This is the official GitHub repository of the paper:
DELTA: Dense Efficient Long-range 3D Tracking for Any video
Tuan Duc Ngo,
Peiye Zhuang,
Chuang Gan,
Evangelos Kalogerakis,
Sergey Tulyakov,
Hsin-Ying Lee,
Chaoyang Wang,
ICLR 2025
Project Page | Arxiv | Paper | BibTeX
DELTA captures dense, long-range, 3D trajectories from casual videos in a feed-forward manner.
- Release model weights on Google Drive and demo script
- Release training code & dataset preparation
- Release evaluation code
- Clone DELTA.
git clone --recursive https://github.com/snap-research/DenseTrack3D
cd DenseTrack3D
## if you have already cloned DenseTrack3D:
# git submodule update --init --recursive
- Create the environment.
conda create -n densetrack3d python=3.10 cmake=3.14.0 -y # we recommend using python<=3.10
conda activate densetrack3d
conda install pytorch torchvision pytorch-cuda=12.1 -c pytorch -c nvidia -y # use the correct version of cuda for your system
pip install pip==24.0 # downgrade pip to install pytorch_lightning==1.6.0
pip3 install -r requirements.txt
conda install ffmpeg -c conda-forge # to write .mp4 video
pip3 install -U "ray[default]" # for parallel processing
pip3 install viser # for visualize 3D trajectories
- Install
Unidepth
.
pip3 install ninja
pip3 install -v -U git+https://github.com/facebookresearch/xformers.git@v0.0.24 # Unidepth requires xformers==0.0.2
- [Optional] Install
viser
andopen3d
for 3D visualization.
pip3 install viser
pip3 install open3d
- [Optional] Install dependencies to generate training data with Kubric.
pip3 install bpy==3.4.0
pip3 install pybullet
pip3 install OpenEXR
pip3 install tensorflow tensorflow-datasets>=4.1.0 tensorflow-graphics
cd data/kubric/
pip install -e .
cd ../..
The pretrained checkpoints can be downloaded on Google Drive.
Run the following commands to download:
# download the weights
mkdir -p ./checkpoints/
gdown --fuzzy https://drive.google.com/file/d/18d5M3nl3AxbG4ZkT7wssvMXZXbmXrnjz/view?usp=sharing -O ./checkpoints/ # 3D ckpt
gdown --fuzzy https://drive.google.com/file/d/1S_T7DzqBXMtr0voRC_XUGn1VTnPk_7Rm/view?usp=sharing -O ./checkpoints/ # 2D ckpt
- We currently support 4 different tracking modes, including Dense 3D Tracking, Sparse 3D Tracking, Dense 2D Tracking, and Sparse 2D Tracking. We include 3 sample videos (
car-roundabout
,rollerblade
from DAVIS, andyellow-duck
generated by SORA) in this repo.
-
Dense 3D Tracking: This is the main contribution of our work, where the model takes an RGB-D video (the videodepth can be obtained by an off-the-shelf depth estimator) and outputs a dense 3D trajectory map. To run the inference code, you can use the following command:
python3 demo.py --ckpt checkpoints/densetrack3d.pth --video_path demo_data/yellow-duck --output_path results/demo # run with Unidepth # or python3 demo.py --ckpt checkpoints/densetrack3d.pth --video_path demo_data/yellow-duck --output_path results/demo --use_depthcrafter # run with DepthCrafter
By default, densely tracking a video of ~100 frames requires ~40GB of GPU memory. To reduce memory consumption, we can use a larger upsample factor (e.g., 8x) and enable fp16 inference, which reduces the requirement to ~20GB of GPU memory:
python3 demo.py --upsample_factor 8 --use_fp16 --ckpt checkpoints/densetrack3d.pth --video_path demo_data/yellow-duck --output_path results/demo
-
Sparse 3D Tracking: We also support sparse 3D point tracking (similar to SceneTracker and SpaTracker), where users can specify which points to track or the model will track a sparse grid of points by default.
python3 demo_sparse.py --ckpt checkpoints/densetrack3d.pth --video_path demo_data/yellow-duck --output_path results/demo # run with Unidepth # or python3 demo_sparse.py --ckpt checkpoints/densetrack3d.pth --video_path demo_data/yellow-duck --output_path results/demo --use_depthcrafter # run with DepthCrafter
-
Dense 2D Tracking: This mode is similar to DOT, where the model only takes an RGB video (no depth input) and outputs the dense 2D coords map (UV map).
python3 demo_2d.py --ckpt checkpoints/densetrack2d.pth --video_path demo_data/yellow-duck --output_path results/demo
-
Sparse 2D Tracking: This mode is similar to CoTracker, where the model only takes an RGB video as input, then users can specify which points to track or the model will track a sparse grid of points by default. The output is a set of 2D trajectories.
python3 demo_2d_sparse.py --ckpt checkpoints/densetrack2d.pth --video_path demo_data/yellow-duck --output_path results/demo
- [Optional] Visualize the dense 3D tracks with
viser
:
python3 visualizer/vis_densetrack3d.py --filepath results/demo/yellow-duck/dense_3d_track.pkl
- [Optional] Visualize the dense 3D tracks with
open3d
(GUI required). To highlight the trajectories of the foreground object, we provide a binary foreground mask for the first frame of the video (the starting frame for dense tracking), which can be obtained with SAM.
# first run with mode=choose_viewpoint, a 3D GUI will pop-up and you can select the proper viewpoint to capture. Press "S" to save the viewpoint and exit.
python3 visualizer/vis_open3d.py --filepath results/yellow-duck/dense_3d_track.pkl --fg_mask_path demo_data/yellow-duck/yellow-duck_mask.png --video_name yellow-duck --mode choose_viewpoint
# Then run with mode=capture to start rendering 2D video of dense tracking
python3 visualizer/vis_open3d.py --filepath results/yellow-duck/dense_3d_track.pkl --fg_mask_path demo_data/yellow-duck/yellow-duck_mask.png --video_name yellow-duck --mode capture
Please follow the instructions here to prepare the training & evaluation data
- Pretrain dense 2D tracking model
bash scripts/train/pretrain_2d.sh
- Train dense 3D tracking model
bash scripts/train/train.sh
- Evaluate sparse 3D tracking on the TAPVid3D Benchmark
# Note: replace TAPVID3D_DIR with the real path to tapvid3d dataset
python3 scripts/eval/eval_3d.py
- Evaluate dense 2D tracking on the CVO Benchmark
# Note: replace CVO_DIR with the real path to CVO dataset
python3 scripts/eval/eval_flow2d.py
- Evaluate sparse 2D tracking on the TAPVid2D Benchmark
# Note: replace TAPVID2D_DIR with the real path to tapvid2d dataset
python3 scripts/eval/eval_2d.py
If you find our repository useful, please consider giving it a star ⭐ and citing our paper in your work:
@article{ngo2024delta,
author = {Ngo, Tuan Duc and Zhuang, Peiye and Gan, Chuang and Kalogerakis, Evangelos and Tulyakov, Sergey and Lee, Hsin-Ying and Wang, Chaoyang},
title = {DELTA: Dense Efficient Long-range 3D Tracking for Any video},
journal = {arXiv preprint arXiv:2410.24211},
year = {2024}
}
Our code is based on CoTracker, SceneTracker, and LocoTrack, the training data generation is based on Kubric, and our visualization code is based on Viser and Open3D. We thank the authors for their excellent work!