clip-like

Train (fine-tune) OpenAI's CLIP models on custom image-caption data sets, cf. COCO dataset.

What's CLIP ?

CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3.

[Blog] [Paper] [Github] [Colab]

Usage

The notebook ’’fine-tune-clip.ipynb’’ could be used to train (fine-tune) a clip-like model from scratch. It illustrates the process on COCO dataset. It leverages the VisionTextDualEncoder toolkit from Hugging Face transformers library. The dual encoder's encoders are pre-trained text and image networks that would be fine-tuned (along with a common specific projection head) to create adapted representations for text-image pairs in a single latent space, by minimizing a constrastive loss computed using those representations. The objective is to maximize similarity between valid pairs and reduce it between invalid ones.

One should be able to launch the notebook on Colab free sessions without any problem.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md
fine-tune-clip.ipynb		fine-tune-clip.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

clip-like

What's CLIP ?

Usage

About

Releases

Packages

Languages

License

ylaxor/clip-like

Folders and files

Latest commit

History

Repository files navigation

clip-like

What's CLIP ?

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages