This project builds upon LAVIS library's BLIP2 mdoel.
The main idea is to replace the tokenizer and the underlying BERT model in Blip2's Qformer with the one trained on Japanese datasets and retrain the upated model on Japanese captioning datasets.
The model has been trained using COCO dataset with STAIR captions.
The weights of Blip2_Japanese_qformer trained on STAIR can be obtained from hugging face.
Copy the whole folder under lavis directory, make sure the directory is called pretrained.
Moreover, download bert-base-japanese-whole-word-masking weights and config from the hugging face link
You should now be able to run the example.ipynb notebook.
For directory naming conventions, you can also refer to the .gitignore file.
Captions generated for flickr30k dataset can be found in flickr30k_caption.json. Script in flickr30k_caption_generate.ipynb.
These captions are generated using top-k sampling instead of nucleus.
Captions generated by the pretrained and finetuned models are shown below:
pretrained: {'image': '1001773457.jpg', 'caption': ['二 匹 の 犬 が 道路 で フリスビー を し て いる']} # No frisbee
finetuned: {'image': '1001773457.jpg', 'caption': ['二 匹 の 犬 が 道路 で 喧嘩 を し て いる']}
pretrained: {'image': '1001573224.jpg', 'caption': ['6 人 の 女性 が 屋内 で 飛び跳ね て いる']} # Wrong head count
finetuned: {'image': '1001573224.jpg', 'caption': ['黒い 服 を 着 た 女性 たち が 飛び跳ね て いる']}
In general, captions generated by the finetuned model are more accurate.
Refer to the example.ipynb notebooks for more details. The idea is to get the average cosine similarity of query tokens between the image embeddings and the multimodal embeddings.
The model was trained on a single GTX4080 GPU(laptop). Hence the config during training is modified as follows:
In blip2_pretrain.yaml: vit_precision = 'fp16'
In pretrain_stage1.yaml: batch_size = 25
During evaluation you have to change vit_precision back to fp32.
The pretrained and finetuned weights may be updated without prior notice. So if you cannot reproduce the results in the exmaple notebook, please re-download the weights and try again.
A simple interface for demo purpose can be found in generator-ui.py. To run the UI:
python generator-ui.py