Skip to content

Train (fine-tune) OpenAI's CLIP-like models on custom image-caption data sets, cf. COCO dataset. PyTorch implementation.

License

Notifications You must be signed in to change notification settings

ylaxor/clip-like

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

clip-like

Train (fine-tune) OpenAI's CLIP models on custom image-caption data sets, cf. COCO dataset.

What's CLIP ?

CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3.

[Blog] [Paper] [Github] [Colab]

Usage

The notebook ’’fine-tune-clip.ipynb’’ could be used to train (fine-tune) a clip-like model from scratch. It illustrates the process on COCO dataset. It leverages the VisionTextDualEncoder toolkit from Hugging Face transformers library. The dual encoder's encoders are pre-trained text and image networks that would be fine-tuned (along with a common specific projection head) to create adapted representations for text-image pairs in a single latent space, by minimizing a constrastive loss computed using those representations. The objective is to maximize similarity between valid pairs and reduce it between invalid ones.

One should be able to launch the notebook on Colab free sessions without any problem.

About

Train (fine-tune) OpenAI's CLIP-like models on custom image-caption data sets, cf. COCO dataset. PyTorch implementation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published