Skip to content

gongzix/NeuroClips

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction

NeuroClips is a novel framework for fMRI-to-video decoding (NeurIPS 2024 Oral). If you like our project, please give us a star ⭐.

Huggingface NeuroClipsarxivGitHub Repo stars

🛠️ Method

model

📣 News

  • Dec. 3, 2024. Full Codes release.
  • Nov. 30, 2024. Pre-processed code and dataset release.
  • Sep. 26, 2024. Accepted by NeurIPS 2024 for Oral Presentation.
  • May. 24, 2024. Project release.

Data Preprocessing

We use the public cc2017(Wen) dataset from this. You can download and follow the official preprocess to only deal with your fMRI data. Only use movie_fmri_data_processing.m and movie_fmri_reproducibility.m, and notice that the selected voxels(Bonferroni correction, P < 0.05) were more than before(Bonferroni correction, P < 0.01).

We also offer our pre-processed fMRI data and frames sampled from videos for training in NeuroClips, and you can directly download them from Huggingface NeuroClips.

You can use python src/caption.py to generate the captions.

Installation

We recommend using the virtual environment for Neuroclips training, inference keyframes, and blurry videos separately from the pre-trained T2V diffusion's virtual environment to avoid any conflict issue of different environment package versions.

For Neuroclips:

. src/setup.sh

For pre-trained AnimateDiffusion, you can follow this:

conda create -n animatediff python==3.10
conda activate animatediff
cd AnimateDiff
pip install -r requirements.txt

Train Semantic Reconstructor

We suggest training the backbone first and then the prior to achieve better Semantic Reconstructor.

conda activate neuroclips
python src/train_SR.py --subj 1 --batch_size 240 --num_epochs 30 --mixup_pct 1.0 --max_lr 1e-4 --use_text
python src/train_SR.py --subj 1 --batch_size 64 --num_epochs 150 --mixup_pct 0.0 --max_lr 3e-4 --use_prior --use_text

Train Perception Reconstructor

python src/train_PR.py --subj 1 --batch_size 40 --mixup_pct 0.0 --num_epochs 80

Reconstruct Keyframe

python src/recon_keyframe.py --subj 1

After keyframes are generated, you could use BLIP-2:python src/caption.py to get captions of keyframes.

Reconstruct Blurry Video

python src/recon_blurry.py --subj 1

Reconstruct Videos

After preparing all the inputs, you can reconstruct the video. You can use any pre-trained T2V or V2V model. We are using the T2V pre-trained model AnimateDiffusion here, specifically SparseCtrl for first-frame guidance.

conda activate animatediff
cd Animatediff
python -m scripts.neuroclips --config configs/NeuroClips/control.yaml

The pre-trained weights you should prepare are in here.

🖼️ Reconstruction Demos

Human Behavior

GT Ours GT Ours GT Ours
GT Ours GT Ours GT Ours

Animals

GT Ours GT Ours GT Ours
GT Ours GT Ours GT Ours

Traffic

GT Ours GT Ours GT Ours

Natural Scene

GT Ours GT Ours GT Ours

Multi-fMRI Fusion

With the help of NeuroClips’ SR, we explored the generation of longer videos for the first time. Since the technical field of long video generation is still immature, we chose a more straightforward fusion strategy that does not require additional GPU training. In the inference process, we consider the semantic similarity of two reconstructed keyframes from two neighboring fMRI samples (here we directly determine whether they belong to the same class of objects, e.g., both are jellyfish). If semantically similar, we replace the keyframe of the latter fMRI with the tail-frame of the former fMRI’s reconstructed video, which will be taken as the first-frame of the latter fMRI to generate the video.

fusion

Fail Cases

Overall the fail cases can be divided into two categories: on the one hand, the semantics are not accurate enough and on the other hand, the scene transition affects the generated results.

Pixel Control & Semantic Deficit

In CC2017 dataset, the video clips in the testing movie were different from those in the training movie, and there were even some categories of objects that didn't appear in the training set. However thanks to NeuroClips' Perceptual Reconstructor, we can still reconstruct the video at a low-level of vision.

GT Ours GT Ours

Scene Transitions

Due to the low-temporal resolution of fMRI (i.e., 2s), a segment of fMRI may include two video scenes, leading to semantic confusion in the video reconstruction, or even semantic and perceptual fusion, as shown in the following image of a jellyfish transitioning to the moon, which ultimately generates a jellyfish with a black background.

GT Ours GT Ours

BibTeX

@article{gong2024neuroclips,
  title={NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction},
  author={Gong, Zixuan and Bao, Guangyin and Zhang, Qi and Wan, Zhongwei and Miao, Duoqian and Wang, Shoujin and Zhu, Lei and Wang, Changwei and Xu, Rongtao and Hu, Liang and others},
  journal={arXiv preprint arXiv:2410.19452},
  year={2024}
}

Acknowledgement

We sincerely thank the following authors, and Neuroclips is based on their excellent open-source projects or impressive ideas.

T2V diffusion: https://github.com/guoyww/AnimateDiff

Excellent Backbone: https://github.com/MedARC-AI/MindEyeV2

Temporal Design: https://arxiv.org/abs/2304.08818

Keyframe Captioning: https://github.com/salesforce/LAVIS/tree/main/projects/blip2

Dataset and Pre-processed code: https://purr.purdue.edu/publications/2809

Releases

No releases published

Packages

No packages published