-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] introduce videoprocessor. #7776
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
src/diffusers/video_processor.py
Outdated
"""Simple video processor.""" | ||
|
||
@staticmethod | ||
def tensor2vid(video: torch.Tensor, processor: "VaeImageProcessor", output_type: str = "np"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you have the methods like frames2gif or frames2mpeg in here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DN6 what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, i meant #7548
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can add this functionality to the export functions we have. Generally the pipelines always return an np array, a torch tensor or PIL image/ PIL image list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Here's an example where tensor2vid
is being used when returning the output:
diffusers/src/diffusers/pipelines/stable_video_diffusion/pipeline_stable_video_diffusion.py
Line 565 in 49b959b
frames = tensor2vid(frames, self.image_processor, output_type=output_type) |
src/diffusers/video_processor.py
Outdated
|
||
def preprocess_video(self, video: List[Union[PIL.Image.Image, np.ndarray, torch.Tensor]]) -> torch.FloatTensor: | ||
"""Preprocesses input video(s).""" | ||
supported_formats = (np.ndarray, torch.Tensor, PIL.Image.Image) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's first decide what should be video input format we accept here, I think
- list of list of images (or a list of images, in that case, we expand it as list of list of images)
- list of list of 4d tensors (or a list of 4d tensor, in this case, we expand)
- list of 5d tensors (or 5d tensor, we expand)
- list of list of 4d numpy arrays (or a list of 4d array, we expand)
- list of 5d numpy arrays (or 5d numpy array, we expand)
cc @DN6 here, anything we would add or remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @a-r-r-o-w too since you worked on a lot of video pipelines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so I think all of those are covered. supported_formats
here refer to the core base format which can be used to create lists, 4d or 5d tensors, etc.
Let me provide concrete lines of code that I think ensure all the formats listed above are supported:
if isinstance(video, list) and isinstance(video[0], list) and isinstance(video[0][0], PIL.Image.Image):
# In case the video is a list of PIL images, convert to a list of ndarrays. |
- and 3.
video = torch.cat(video, axis=0) if video[0].ndim == 5 else torch.stack(video, axis=0) - and 5.
video = np.concatenate(video, axis=0) if video[0].ndim == 5 else np.stack(video, axis=0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looping me in. I think that list[list[image]]
, list[fchw_tensor]
and bfchw_tensor
are most common to end up with as a user, and all cases you mentioned seem to be handled well and the code makes sense to me.
A small explanation of expected inputs to video processor in the docstrings, order of tensor dims, and some more documentation, like in image processor, would be helpful for newer users imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @a-r-r-o-w! Added a bit of documentation and type annotation to hopefully make it clearer. LMK your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a lot of the work here is overlapping with preprocess
though
once we make sure the inputs are accepted as one of below (and expanded)
- list of list of images (or a list of images, in that case, we expand it as list of list of images)
- list of list of 4d tensors (or a list of 4d tensor, in this case, we expand)
- list of 5d tensors (or 5d tensor, we expand)
- list of list of 4d numpy arrays (or a list of 4d array, we expand)
- list of 5d numpy arrays (or 5d numpy array, we expand)
can we try:
video = [self.image_processor.preprocess(vid) for vid in videos]
after that we check if all the tensors has same shape in the list and throw an error if not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a lot of the work here is overlapping with preprocess though
Help me understand this a bit better? From what I understand, the current preprocess_video()
is first checking if we have the inputs in the accepted format and is performing the expansion if needed. And then it's passing off the video
to preprocess()
like you suggested:
diffusers/src/diffusers/video_processor.py
Line 125 in c5d22e6
video = torch.stack([self.preprocess(f) for f in video], dim=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only overlap I see is this:
diffusers/src/diffusers/video_processor.py
Line 106 in c5d22e6
video = np.array(video).astype(np.float32) / 255.0 |
But I see this more as a safeguard rather than an overlap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think something like this should work (made up code, but roughly the logic)
if isinstance(video, supported_formats):
video = [video]
if isinstance(video[0], PIL.Image.Image)or if isinstance(video[0], ( np.ndarray, torch.tensor) and video[0].ndim ==4):
video = [video]
video = torch.stack([self.preprocess(f) for f in video],..)
video = video.permute(0, 2, 1, 3, 4)
How about let's first add tests like this https://github.com/huggingface/diffusers/blob/main/tests/others/test_image_processor.py
once we have the test, I can help look into refactoring this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. LMK what you think. I had to add a bit of code to deal with 5D stuff. But rest has been simplified a lot IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks beautiful to me seeing the multiple copies go away! Just one minor fix with naming for consistency
src/diffusers/pipelines/stable_video_diffusion/pipeline_stable_video_diffusion.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/stable_video_diffusion/pipeline_stable_video_diffusion.py
Outdated
Show resolved
Hide resolved
The failing test is completely unexpected. Need to look deeper. @yiyixuxu WDYT about the |
elif isinstance(video, list) and isinstance(video[0], PIL.Image.Image): | ||
if isinstance(video, list) and isinstance(video[0], np.ndarray) and video[0].ndim == 5: | ||
warnings.warn( | ||
"Passing `video` as a list of 5d np.ndarray is deprecated." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically not deprecated since we didn't support it before right? I think we just raise an error here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was kinda supported before
diffusers/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth_img2img.py
Line 119 in 0d23645
def preprocess_video(video): |
this code, if you pass a list of 5d tensors, it would work; I think it is because of the way the code was written, here
diffusers/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth_img2img.py
Line 123 in 0d23645
video = [video] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍🏽
src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py
Outdated
Show resolved
Hide resolved
if video and not isinstance(video[0], list): | ||
video = [video] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if video and not isinstance(video[0], list): | |
video = [video] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why do we want to do this? I don't think a single-frame video would get represented properly otherwise. We don't have any special treatment for that from preprocess_video
either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do they need video-to-video pipeline if it is single-frame? i.e. an image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t know honestly. I tried to follow the original implementation (i.e., the implementation before the refactor) as faithfully as possible.
From the documentation though, it seems like it supports multi-frame videos too:
https://huggingface.co/docs/diffusers/en/api/pipelines/animatediff
So, my best guess is that it supports both — single-frame video and multi-frame video.
# as a list of images | ||
if video and not isinstance(video[0], list): | ||
video = [video] | ||
if latents is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe raise an error in check_inputs
when both latents
and video
is not None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already there:
diffusers/src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py
Line 598 in 35358a2
if video is not None and latents is not None: |
src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py
Outdated
Show resolved
Hide resolved
@yiyixuxu could you look into the failing test? It's likely coming from image processor. But if not, let me know. |
@sayakpaul can confirm that test failure is due to image processor, I will fix it! |
@yiyixuxu I think I have addressed your comments. LMK if this is good to merge. |
i left one comment https://github.com/huggingface/diffusers/pull/7776/files#r1597075529 |
Okay. I will delete the block that listifies a single image. |
feel free to merge once the tests pass :) |
What does this PR do?
Introduces a
VideoProcessor
akin toVaeImageProcessor
to encapsulate the logic of dealing with videos.TODOs