Could you please provide more examples to do inference on the different tasks in the paper? #234

buaalyx · 2025-01-24T08:59:35Z

Such as temporal grounding on QVHighlight and Charade-STA

dongdk · 2025-02-05T08:07:26Z

+1

arushirai1 · 2025-02-26T19:53:59Z

Same question here. I have tried getting an output in frames or seconds and both seem to perform poorly.

Shuaicong97 · 2025-03-11T16:11:36Z

Any update? @shepnerd @buaalyx I tried to use InternVideo2/demo, since they're using the pretrained Bert, the feature dim is 512.

Shuaicong97 mentioned this issue Mar 14, 2025

Text + Video Features extraction using custom dataset Zhuo-Cao/FlashVTG#3

Open

Provide feedback