Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducibility question #4

Open
ytaek-oh opened this issue Feb 5, 2025 · 5 comments
Open

Reproducibility question #4

ytaek-oh opened this issue Feb 5, 2025 · 5 comments

Comments

@ytaek-oh
Copy link

ytaek-oh commented Feb 5, 2025

Thanks for sharing the code and checkpoint for the insightful work!

I ran the evaluation code using the provided checkpoint and got a slight gap between my evaluation results and those reported in the paper.
Specifically, I tested retrieval tasks on coco 2017 and Urban1k, and obtained following results:

{'coco': {0: {'image_to_text_R@1': 0.5848, 'image_to_text_R@5': 0.827, 'image_to_text_R@10': 0.8932,
               'text_to_image_R@1': 0.44212, 'text_to_image_R@5': 0.69456, 'text_to_image_R@10': 0.78664}}}

{'urban1k': {0: {'image_to_text_R@1': 0.882, 'image_to_text_R@5': 0.976, 'image_to_text_R@10': 0.987,
                 'text_to_image_R@1': 0.885, 'text_to_image_R@5': 0.978, 'text_to_image_R@10': 0.989}}}

Reported Results in the Paper (TULIP, ViT-L-14):

  • COCO:
    I2T R@1: 62.6, I2T R@5: 84.7
    T2I R@1: 46.1, T2I R@5: 71.1

  • Urban1k:
    I2T R@1: 90.1, T2I R@1: 91.1

I used the following command for evaluation:

python eval_tulip.py --model_name ViT-L-14 --pretrained ckpt.pt --distilled_model_path ckpt.pt \
    --pos_encodings rope --context_length 200

Am I missing any configuration or steps in the evaluation process?

Thanks,

@ivonajdenkoska
Copy link
Owner

Hi @ytaek-oh. Thanks for your interest!

It is weird to see this, can you try using --context_length 248? We use 248 for the main results for a fair comparison to the baselines (e.g. Long-CLIP).

@5RJ
Copy link

5RJ commented Feb 12, 2025

have you reproduced the result in paper? I've tried both context_length=248 and context_length=200, the result got the same.
context_length=248
{'urban1k': {0: {'image_to_text_R@1': 0.881, 'image_to_text_R@5': 0.975, 'image_to_text_R@10': 0.986, 'text_to_image_R@1': 0.884, 'text_to_image_R@5': 0.977, 'text_to_image_R@10': 0.988}}}
context_length=200
{'urban1k': {0: {'image_to_text_R@1': 0.881, 'image_to_text_R@5': 0.975, 'image_to_text_R@10': 0.986, 'text_to_image_R@1': 0.884, 'text_to_image_R@5': 0.977, 'text_to_image_R@10': 0.988}}
The cmd is as follows:
export CUDA_VISIBLE_DEVICES=1 && python eval_tulip.py --model_name ViT-L-14 --pretrained ckpt.pt --distilled_model_path ckpt.pt --pos_encodings rope --context_length 200 --run_urban1k

The "ckpt.pt" is download from https://huggingface.co/mderakhshani/TULIP/tree/main

@ytaek-oh
Copy link
Author

ytaek-oh commented Feb 12, 2025

@5RJ
Changing the context length did not affect the results for me.

Setting the context length to 248 while switching the activation layer to QuickGELU led to a slight improvement, though it did not fully bridge the gap.
(Set force_quick_gelu=True for open_clip.create_model_and_transforms, and set act_layer=QuickGELU when initializing TextTransformerRoPE.)

# Applying QuickGELU as the activation layer
coco 2017: ([('i2t/R@1', 60.66), ('i2t/R@5', 83.8), ('i2t/R@10', 90.08),
             ('t2i/R@1', 44.55),  ('t2i/R@5', 69.44), ('t2i/R@10', 78.42)]
urban 1k: ([('i2t/R@1', 88.7), ('i2t/R@5', 98.0),  ('i2t/R@10', 98.8), 
            ('t2i/R@1', 91.0), ('t2i/R@5', 98.1), ('t2i/R@10', 99.0)])

@ivonajdenkoska
Copy link
Owner

Hi, can you please check the value of the flag --lit_style? It should be false.

@5RJ
Copy link

5RJ commented Feb 13, 2025

yes, the lit_style is False, I print the args as follows:

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants