Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] introduce videoprocessor. #7776

Merged
merged 46 commits into from
May 10, 2024
Merged

[Core] introduce videoprocessor. #7776

merged 46 commits into from
May 10, 2024

Conversation

sayakpaul
Copy link
Member

@sayakpaul sayakpaul commented Apr 25, 2024

What does this PR do?

Introduces a VideoProcessor akin to VaeImageProcessor to encapsulate the logic of dealing with videos.

TODOs

  • Add tests
  • Docs

@sayakpaul sayakpaul requested review from DN6 and yiyixuxu April 25, 2024 08:59
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

"""Simple video processor."""

@staticmethod
def tensor2vid(video: torch.Tensor, processor: "VaeImageProcessor", output_type: str = "np"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you have the methods like frames2gif or frames2mpeg in here as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DN6 what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, i meant #7548

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add this functionality to the export functions we have. Generally the pipelines always return an np array, a torch tensor or PIL image/ PIL image list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Here's an example where tensor2vid is being used when returning the output:

frames = tensor2vid(frames, self.image_processor, output_type=output_type)

src/diffusers/video_processor.py Outdated Show resolved Hide resolved
src/diffusers/video_processor.py Outdated Show resolved Hide resolved
@sayakpaul sayakpaul requested a review from yiyixuxu April 27, 2024 04:21
@sayakpaul sayakpaul requested a review from DN6 April 29, 2024 11:23

def preprocess_video(self, video: List[Union[PIL.Image.Image, np.ndarray, torch.Tensor]]) -> torch.FloatTensor:
"""Preprocesses input video(s)."""
supported_formats = (np.ndarray, torch.Tensor, PIL.Image.Image)
Copy link
Collaborator

@yiyixuxu yiyixuxu Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's first decide what should be video input format we accept here, I think

  1. list of list of images (or a list of images, in that case, we expand it as list of list of images)
  2. list of list of 4d tensors (or a list of 4d tensor, in this case, we expand)
  3. list of 5d tensors (or 5d tensor, we expand)
  4. list of list of 4d numpy arrays (or a list of 4d array, we expand)
  5. list of 5d numpy arrays (or 5d numpy array, we expand)

cc @DN6 here, anything we would add or remove?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @a-r-r-o-w too since you worked on a lot of video pipelines

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so I think all of those are covered. supported_formats here refer to the core base format which can be used to create lists, 4d or 5d tensors, etc.

Let me provide concrete lines of code that I think ensure all the formats listed above are supported:

  1. if isinstance(video, list) and isinstance(video[0], list) and isinstance(video[0][0], PIL.Image.Image):
    and

# In case the video is a list of PIL images, convert to a list of ndarrays.

  1. and 3.
    video = torch.cat(video, axis=0) if video[0].ndim == 5 else torch.stack(video, axis=0)
  2. and 5.
    video = np.concatenate(video, axis=0) if video[0].ndim == 5 else np.stack(video, axis=0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looping me in. I think that list[list[image]], list[fchw_tensor] and bfchw_tensor are most common to end up with as a user, and all cases you mentioned seem to be handled well and the code makes sense to me.

A small explanation of expected inputs to video processor in the docstrings, order of tensor dims, and some more documentation, like in image processor, would be helpful for newer users imo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @a-r-r-o-w! Added a bit of documentation and type annotation to hopefully make it clearer. LMK your thoughts.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot of the work here is overlapping with preprocess though

once we make sure the inputs are accepted as one of below (and expanded)

  • list of list of images (or a list of images, in that case, we expand it as list of list of images)
  • list of list of 4d tensors (or a list of 4d tensor, in this case, we expand)
  • list of 5d tensors (or 5d tensor, we expand)
  • list of list of 4d numpy arrays (or a list of 4d array, we expand)
  • list of 5d numpy arrays (or 5d numpy array, we expand)

can we try:

video = [self.image_processor.preprocess(vid) for vid in videos]

after that we check if all the tensors has same shape in the list and throw an error if not

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot of the work here is overlapping with preprocess though

Help me understand this a bit better? From what I understand, the current preprocess_video() is first checking if we have the inputs in the accepted format and is performing the expansion if needed. And then it's passing off the video to preprocess() like you suggested:

video = torch.stack([self.preprocess(f) for f in video], dim=0)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only overlap I see is this:

video = np.array(video).astype(np.float32) / 255.0

But I see this more as a safeguard rather than an overlap.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think something like this should work (made up code, but roughly the logic)

if isinstance(video, supported_formats):
    video = [video]
if isinstance(video[0],  PIL.Image.Image)or if isinstance(video[0], ( np.ndarray, torch.tensor) and video[0].ndim ==4):
    video = [video]

video = torch.stack([self.preprocess(f) for f in video],..)
video = video.permute(0, 2, 1, 3, 4)

How about let's first add tests like this https://github.com/huggingface/diffusers/blob/main/tests/others/test_image_processor.py
once we have the test, I can help look into refactoring this function?

Copy link
Member Author

@sayakpaul sayakpaul May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. LMK what you think. I had to add a bit of code to deal with 5D stuff. But rest has been simplified a lot IMO.

@yiyixuxu

Copy link
Member

@a-r-r-o-w a-r-r-o-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks beautiful to me seeing the multiple copies go away! Just one minor fix with naming for consistency

@sayakpaul
Copy link
Member Author

The failing test is completely unexpected. Need to look deeper. @yiyixuxu WDYT about the preprocess_video refactor btw?

elif isinstance(video, list) and isinstance(video[0], PIL.Image.Image):
if isinstance(video, list) and isinstance(video[0], np.ndarray) and video[0].ndim == 5:
warnings.warn(
"Passing `video` as a list of 5d np.ndarray is deprecated."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically not deprecated since we didn't support it before right? I think we just raise an error here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was kinda supported before


this code, if you pass a list of 5d tensors, it would work; I think it is because of the way the code was written, here the first thing it does is to check if it is a tensor, and if so add it to a list; so in order to support a 5d tensor, it has to support a list of 5d tensor as well. This is same situation with image processor as well

Copy link
Collaborator

@DN6 DN6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍🏽

Comment on lines 892 to 893
if video and not isinstance(video[0], list):
video = [video]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if video and not isinstance(video[0], list):
video = [video]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why do we want to do this? I don't think a single-frame video would get represented properly otherwise. We don't have any special treatment for that from preprocess_video either.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do they need video-to-video pipeline if it is single-frame? i.e. an image?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t know honestly. I tried to follow the original implementation (i.e., the implementation before the refactor) as faithfully as possible.

From the documentation though, it seems like it supports multi-frame videos too:
https://huggingface.co/docs/diffusers/en/api/pipelines/animatediff

So, my best guess is that it supports both — single-frame video and multi-frame video.

# as a list of images
if video and not isinstance(video[0], list):
video = [video]
if latents is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe raise an error in check_inputs when both latents and video is not None

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already there:

if video is not None and latents is not None:

src/diffusers/video_processor.py Outdated Show resolved Hide resolved
@sayakpaul
Copy link
Member Author

@yiyixuxu could you look into the failing test? It's likely coming from image processor. But if not, let me know.

@yiyixuxu
Copy link
Collaborator

could you look into the failing test? It's likely coming from image processor. But if not, let me know.

@sayakpaul can confirm that test failure is due to image processor, I will fix it!

@sayakpaul
Copy link
Member Author

@yiyixuxu I think I have addressed your comments. LMK if this is good to merge.

@yiyixuxu
Copy link
Collaborator

i left one comment https://github.com/huggingface/diffusers/pull/7776/files#r1597075529
we should accept same video input format across all our pipelines, I don't think that animatediff need to be different
other than that it looks good to me!

@sayakpaul
Copy link
Member Author

we should accept same video input format across all our pipelines, I don't think that animatediff need to be different

Okay. I will delete the block that listifies a single image.

@yiyixuxu
Copy link
Collaborator

feel free to merge once the tests pass :)

@sayakpaul sayakpaul requested a review from yiyixuxu May 10, 2024 19:02
@sayakpaul sayakpaul merged commit 04f4bd5 into main May 10, 2024
17 checks passed
@sayakpaul sayakpaul deleted the video-processor branch May 10, 2024 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants