This repository contains a tensorflow implementation of X3D: Expanding Architectures for Efficient Video Recognition. X3D networks are derived by expanding multiple axes of a tiny 2D image classification network using a stepwise network expansion method. This allows the networks to achieve good accuracy-to-complexity trade-off on video classification tasks.
Optional: conda create --name x3d-tf tensorflow-gpu
pip install -r requirements.txt
The data preparation options provided here were developed and tested for the kinetics-400 dataset, which can be downloaded using this repo. These options expect the following folder/file structure:
-- class_name_1
-- video_1.mp4
-- video_2.mkv
-- class_name_2
-- video_1.avi
-- video_2.mp4
.
.
.
-- class_name_n
-- video_1.webm
-- video_2.mp4
The options should work on a custom dataset with a similar file structure.
This option decodes the video frames and encodes each of them as JPEGs before serializing and writing the frames to TFRecord files. Using TFRecord files provides prefetching benefits and improves I/O parallelization, which are especially useful when dealing with video dataset. In other words, using this option, as opposed to the option 2, will speed up training time. The major downside to using this option is that it requires more disk space to store the TFRecord files. In the case of the kinetics-400 dataset, the TFRecord files took 1.3 TB of disk space for the training (~235k videos) and validation (~19.8k videos) sets. This is about a 10x increase in the original dataset size. (Note that only frames making up the first 10 seconds of a video are stored in the TFRecord format).
Use the following command to create TFRecord files:
PYTHONPATH=".:$PYTHONPATH" python datasets/create_tfrecords.py --set=<train, val or test> --video_dir=path_to_your_data_folder --label_map=datasets/kinetics400/label_map.json --output_dir=tfrecords/rec --videos_per_record=32
To verify/visualize the contents of the TFRecord files, use the following command:
PYTHONPATH=".:$PYTHONPATH" python datasets/inspect_tfrecord.py --cfg_file=configs/kinetics/X3D_M.yaml --label_map_file=datasets/kinetics400/label_map.json --file_pattern=tfrecords/rec-val-* --eval --num_samples=32
This option is provided in the case where disk space is limited to store TFRecord files. It will generate a file containing lines of strings where each line contains a path to a video file and the corresponding class id for the video (e. g. path/to/video.mp4 6
).
PYTHONPATH=".:$PYTHONPATH" python datasets/create_label.py --data_dir=path_to_your_data_folder --path_to_label_map=datasets/kinetics400/label_map.json --output_path=datasets/kinetics400/train.txt
To train the model(s) in the paper, run this command:
python train.py --train_file_pattern=tfrecords/rec-train* --val_file_pattern=tfrecords/rec-val* --use_tfrecords --pretrained_ckpt=models/X3D-XS/ --model_dir=path_to_your_model_folder --config=configs/kinetics/X3D_XS.yaml --num_gpus=1
To evaluate a model, run:
python eval.py --test_file_pattern=tfrecords/rec-test* --model_folder=models/X3D-XS --cfg=configs/kinetics/X3D-XS.yaml --gpus=1 --tfrecord
The table below shows the performance of various models from this implementation on the video classification task using the Kinetics-400 dataset. Training was done on 4 Tesla V100 GPUs. 10-center clip testing was used on both the validation and test set (~33.7k videos).
K400-val | K400-test | |||
---|---|---|---|---|
Model | Top-1 | Top-5 | Top-1 | Top-5 |
X3D-XS | 60.3 | 83.2 | 60.2 | 83.1 |
X3D-S | 66.8 | 87.7 | 65.7 | 87.1 |
X3D-M | 69.4 | 89.3 | 68.6 | 89.0 |
Training and evaluation are logged on weights & biases. Pretrained weights can be found in the models/
folder.
- Support both reading from TFRecord files and decoding raw video files
- Train models on Kinetics-400 dataset
- X3D-XS
- X3D-S
- X3D-M
- Add multigrid training
- Add localization head to network
- Train models on the Charades dataset
Contributions are welcome.
If you find this work useful, consider citing the original paper:
@inproceedings{feichtenhofer2020x3d,
title={X3D: Expanding Architectures for Efficient Video Recognition},
author={Feichtenhofer, Christoph},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={203--213},
year={2020}
}
I would like to thank Kumara Kahatapitiya for sharing the training and validation sets of the Kinetics-400 dataset.