Skip to content

snap-research/DELTA_densetrack3d

Repository files navigation

DELTA: Dense Efficient Long-range 3D Tracking for Any video

This is the official GitHub repository of the paper:

DELTA: Dense Efficient Long-range 3D Tracking for Any video
Tuan Duc Ngo, Peiye Zhuang, Chuang Gan, Evangelos Kalogerakis, Sergey Tulyakov, Hsin-Ying Lee, Chaoyang Wang,
ICLR 2025

DELTA captures dense, long-range, 3D trajectories from casual videos in a feed-forward manner.

TODO

  • Release model weights on Google Drive and demo script
  • Release training code & dataset preparation
  • Release evaluation code

Getting Started

Installation

  1. Clone DELTA.
git clone --recursive https://github.com/snap-research/DenseTrack3D
cd DenseTrack3D
## if you have already cloned DenseTrack3D:
# git submodule update --init --recursive
  1. Create the environment.
conda create -n densetrack3d python=3.10 cmake=3.14.0 -y # we recommend using python<=3.10
conda activate densetrack3d 
conda install pytorch torchvision pytorch-cuda=12.1 -c pytorch -c nvidia -y  # use the correct version of cuda for your system

pip install pip==24.0 # downgrade pip to install pytorch_lightning==1.6.0
pip3 install -r requirements.txt
conda install ffmpeg -c conda-forge # to write .mp4 video

pip3 install -U "ray[default]" # for parallel processing
pip3 install viser # for visualize 3D trajectories
  1. Install Unidepth.
pip3 install ninja
pip3 install -v -U git+https://github.com/facebookresearch/xformers.git@v0.0.24 # Unidepth requires xformers==0.0.2
  1. [Optional] Install viser and open3d for 3D visualization.
pip3 install viser
pip3 install open3d
  1. [Optional] Install dependencies to generate training data with Kubric.
pip3 install bpy==3.4.0
pip3 install pybullet
pip3 install OpenEXR
pip3 install tensorflow tensorflow-datasets>=4.1.0 tensorflow-graphics

cd data/kubric/
pip install -e .
cd ../..

Download Checkpoints

The pretrained checkpoints can be downloaded on Google Drive.

Run the following commands to download:

# download the weights
mkdir -p ./checkpoints/
gdown --fuzzy https://drive.google.com/file/d/18d5M3nl3AxbG4ZkT7wssvMXZXbmXrnjz/view?usp=sharing -O ./checkpoints/ # 3D ckpt
gdown --fuzzy https://drive.google.com/file/d/1S_T7DzqBXMtr0voRC_XUGn1VTnPk_7Rm/view?usp=sharing -O ./checkpoints/ # 2D ckpt

Inference

  1. We currently support 4 different tracking modes, including Dense 3D Tracking, Sparse 3D Tracking, Dense 2D Tracking, and Sparse 2D Tracking. We include 3 sample videos (car-roundabout, rollerblade from DAVIS, and yellow-duck generated by SORA) in this repo.
  • Dense 3D Tracking: This is the main contribution of our work, where the model takes an RGB-D video (the videodepth can be obtained by an off-the-shelf depth estimator) and outputs a dense 3D trajectory map. To run the inference code, you can use the following command:

    python3 demo.py --ckpt checkpoints/densetrack3d.pth --video_path demo_data/yellow-duck --output_path results/demo # run with Unidepth
    
    # or
    python3 demo.py --ckpt checkpoints/densetrack3d.pth --video_path demo_data/yellow-duck --output_path results/demo --use_depthcrafter # run with DepthCrafter

    By default, densely tracking a video of ~100 frames requires ~40GB of GPU memory. To reduce memory consumption, we can use a larger upsample factor (e.g., 8x) and enable fp16 inference, which reduces the requirement to ~20GB of GPU memory:

    python3 demo.py --upsample_factor 8 --use_fp16 --ckpt checkpoints/densetrack3d.pth --video_path demo_data/yellow-duck --output_path results/demo
  • Sparse 3D Tracking: We also support sparse 3D point tracking (similar to SceneTracker and SpaTracker), where users can specify which points to track or the model will track a sparse grid of points by default.

    python3 demo_sparse.py --ckpt checkpoints/densetrack3d.pth --video_path demo_data/yellow-duck --output_path results/demo # run with Unidepth
    
    # or
    python3 demo_sparse.py --ckpt checkpoints/densetrack3d.pth --video_path demo_data/yellow-duck --output_path results/demo --use_depthcrafter # run with DepthCrafter
  • Dense 2D Tracking: This mode is similar to DOT, where the model only takes an RGB video (no depth input) and outputs the dense 2D coords map (UV map).

    python3 demo_2d.py --ckpt checkpoints/densetrack2d.pth --video_path demo_data/yellow-duck --output_path results/demo
  • Sparse 2D Tracking: This mode is similar to CoTracker, where the model only takes an RGB video as input, then users can specify which points to track or the model will track a sparse grid of points by default. The output is a set of 2D trajectories.

    python3 demo_2d_sparse.py --ckpt checkpoints/densetrack2d.pth --video_path demo_data/yellow-duck --output_path results/demo
  1. [Optional] Visualize the dense 3D tracks with viser:
python3 visualizer/vis_densetrack3d.py --filepath results/demo/yellow-duck/dense_3d_track.pkl
  1. [Optional] Visualize the dense 3D tracks with open3d (GUI required). To highlight the trajectories of the foreground object, we provide a binary foreground mask for the first frame of the video (the starting frame for dense tracking), which can be obtained with SAM.
# first run with mode=choose_viewpoint, a 3D GUI will pop-up and you can select the proper viewpoint to capture. Press "S" to save the viewpoint and exit.
python3 visualizer/vis_open3d.py --filepath results/yellow-duck/dense_3d_track.pkl --fg_mask_path demo_data/yellow-duck/yellow-duck_mask.png --video_name yellow-duck --mode choose_viewpoint

# Then run with mode=capture to start rendering 2D video of dense tracking
python3 visualizer/vis_open3d.py --filepath results/yellow-duck/dense_3d_track.pkl --fg_mask_path demo_data/yellow-duck/yellow-duck_mask.png --video_name yellow-duck --mode capture

Prepare training & evaluation data

Please follow the instructions here to prepare the training & evaluation data

Training

  1. Pretrain dense 2D tracking model
bash scripts/train/pretrain_2d.sh
  1. Train dense 3D tracking model
bash scripts/train/train.sh

Evaluation

  1. Evaluate sparse 3D tracking on the TAPVid3D Benchmark
# Note: replace TAPVID3D_DIR with the real path to tapvid3d dataset
python3 scripts/eval/eval_3d.py
  1. Evaluate dense 2D tracking on the CVO Benchmark
# Note: replace CVO_DIR with the real path to CVO dataset
python3 scripts/eval/eval_flow2d.py
  1. Evaluate sparse 2D tracking on the TAPVid2D Benchmark
# Note: replace TAPVID2D_DIR with the real path to tapvid2d dataset
python3 scripts/eval/eval_2d.py

Citing DELTA

If you find our repository useful, please consider giving it a star ⭐ and citing our paper in your work:

@article{ngo2024delta,
  author    = {Ngo, Tuan Duc and Zhuang, Peiye and Gan, Chuang and Kalogerakis, Evangelos and Tulyakov, Sergey and Lee, Hsin-Ying and Wang, Chaoyang},
  title     = {DELTA: Dense Efficient Long-range 3D Tracking for Any video},
  journal   = {arXiv preprint arXiv:2410.24211},
  year      = {2024}
}

Acknowledgements

Our code is based on CoTracker, SceneTracker, and LocoTrack, the training data generation is based on Kubric, and our visualization code is based on Viser and Open3D. We thank the authors for their excellent work!

About

DELTA: Dense Efficient Long-range 3D Tracking for Any video (ICLR 2025)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published