Reproducibility question #4

ytaek-oh · 2025-02-05T10:47:13Z

Thanks for sharing the code and checkpoint for the insightful work!

I ran the evaluation code using the provided checkpoint and got a slight gap between my evaluation results and those reported in the paper.
Specifically, I tested retrieval tasks on coco 2017 and Urban1k, and obtained following results:

{'coco': {0: {'image_to_text_R@1': 0.5848, 'image_to_text_R@5': 0.827, 'image_to_text_R@10': 0.8932,
               'text_to_image_R@1': 0.44212, 'text_to_image_R@5': 0.69456, 'text_to_image_R@10': 0.78664}}}

{'urban1k': {0: {'image_to_text_R@1': 0.882, 'image_to_text_R@5': 0.976, 'image_to_text_R@10': 0.987,
                 'text_to_image_R@1': 0.885, 'text_to_image_R@5': 0.978, 'text_to_image_R@10': 0.989}}}

Reported Results in the Paper (TULIP, ViT-L-14):

COCO:
I2T R@1: 62.6, I2T R@5: 84.7
T2I R@1: 46.1, T2I R@5: 71.1
Urban1k:
I2T R@1: 90.1, T2I R@1: 91.1

I used the following command for evaluation:

python eval_tulip.py --model_name ViT-L-14 --pretrained ckpt.pt --distilled_model_path ckpt.pt \
    --pos_encodings rope --context_length 200

Am I missing any configuration or steps in the evaluation process?

Thanks,

The text was updated successfully, but these errors were encountered:

ivonajdenkoska · 2025-02-05T17:57:40Z

Hi @ytaek-oh. Thanks for your interest!

It is weird to see this, can you try using --context_length 248? We use 248 for the main results for a fair comparison to the baselines (e.g. Long-CLIP).

5RJ · 2025-02-12T09:17:39Z

have you reproduced the result in paper? I've tried both context_length=248 and context_length=200, the result got the same.
context_length=248
{'urban1k': {0: {'image_to_text_R@1': 0.881, 'image_to_text_R@5': 0.975, 'image_to_text_R@10': 0.986, 'text_to_image_R@1': 0.884, 'text_to_image_R@5': 0.977, 'text_to_image_R@10': 0.988}}}
context_length=200
{'urban1k': {0: {'image_to_text_R@1': 0.881, 'image_to_text_R@5': 0.975, 'image_to_text_R@10': 0.986, 'text_to_image_R@1': 0.884, 'text_to_image_R@5': 0.977, 'text_to_image_R@10': 0.988}}
The cmd is as follows:
export CUDA_VISIBLE_DEVICES=1 && python eval_tulip.py --model_name ViT-L-14 --pretrained ckpt.pt --distilled_model_path ckpt.pt --pos_encodings rope --context_length 200 --run_urban1k

The "ckpt.pt" is download from https://huggingface.co/mderakhshani/TULIP/tree/main

ytaek-oh · 2025-02-12T14:11:20Z

@5RJ
Changing the context length did not affect the results for me.

Setting the context length to 248 while switching the activation layer to QuickGELU led to a slight improvement, though it did not fully bridge the gap.
(Set force_quick_gelu=True for open_clip.create_model_and_transforms, and set act_layer=QuickGELU when initializing TextTransformerRoPE.)

# Applying QuickGELU as the activation layer
coco 2017: ([('i2t/R@1', 60.66), ('i2t/R@5', 83.8), ('i2t/R@10', 90.08),
             ('t2i/R@1', 44.55),  ('t2i/R@5', 69.44), ('t2i/R@10', 78.42)]
urban 1k: ([('i2t/R@1', 88.7), ('i2t/R@5', 98.0),  ('i2t/R@10', 98.8), 
            ('t2i/R@1', 91.0), ('t2i/R@5', 98.1), ('t2i/R@10', 99.0)])

ivonajdenkoska · 2025-02-13T09:57:09Z

Hi, can you please check the value of the flag --lit_style? It should be false.

5RJ · 2025-02-13T10:17:17Z

yes, the lit_style is False, I print the args as follows:

5RJ mentioned this issue Feb 14, 2025

Ask for the details to reproduce the two-stage training #7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducibility question #4

Reproducibility question #4

ytaek-oh commented Feb 5, 2025

ivonajdenkoska commented Feb 5, 2025

5RJ commented Feb 12, 2025

ytaek-oh commented Feb 12, 2025 •

edited

Loading

ivonajdenkoska commented Feb 13, 2025

5RJ commented Feb 13, 2025

Reproducibility question #4

Reproducibility question #4

Comments

ytaek-oh commented Feb 5, 2025

ivonajdenkoska commented Feb 5, 2025

5RJ commented Feb 12, 2025

ytaek-oh commented Feb 12, 2025 • edited Loading

ivonajdenkoska commented Feb 13, 2025

5RJ commented Feb 13, 2025

ytaek-oh commented Feb 12, 2025 •

edited

Loading