The images are under dataset_construction/dataset_images The code for generating the images is in dataset_construction/CreateImage.py Note: You might need to install a variety of fonts in the system in order to run the code.
Run generate_qa.ipynb to generate the full dataset and save them to a local dataset.
Our code for running experiments rely on prl_ml(https://github.com/nuprl/prl_ml), which supports resumption.
Keep batch-size as 1 for all experiments, as batching is not possible for those models.
To run the experiments, first clone the prl_ml repository. Then drag all the files in run_model directory into prl_ml/batched_lm_generation directory. cd into the repository to run the generations with the commands below.
For each model, run the code file named by the model name. argument --fewshot-prompt: choose from '1shot', '3shot', '5shot', None(leave empty for zero-shot)
GPT-4o eg. Run GPT-4o using gpt4o_vision.py few-shot with three examples.
python3 -m prl_ml.batched_lm_generation.gpt4o_vision \ --dataset "disk:path_to_GlyphPattern_repo/dataset_construction/GlyphPattern/four_choice" \ --output-dir gpt4o-4choice-3shot \ --model-name gpt-4o \ --temperature 0 \ --batch-size 1 \ --completion-limit 1 \ --max-tokens 10 \ --prompt-keys prompt,images \ --fewshot-prompt 3shot \ --extra-columns file_name,answer
eg. cot:
bash python3 -m prl_ml.batched_lm_generation.gpt4o_vision_cot \ --dataset "disk:path_to_GlyphPattern_repo/dataset_construction/GlyphPattern/four_choice" \ --output-dir gpt4o-4choice-fewshot-cot \ --model-name gpt-4o-2024-05-13 \ --temperature 0 \ --batch-size 1 \ --completion-limit 1 \ --max-tokens 10 \ --prompt-keys prompt,images \ --fewshot-prompt 3shot \ --extra-columns file_name,answer
Gemini 1.5 eg. Run Gemini1.5Pro using geminipro_vision.py few-shot with five examples.
python3 -m prl_ml.batched_lm_generation.geminipro_vision \ --dataset "disk:path_to_GlyphPattern_repo/dataset_construction/GlyphPattern/four_choice" \ --output-dir geminipro-4choice-5shot \ --model-name gemini-1.5-pro \ --temperature 0 \ --batch-size 1 \ --completion-limit 1 \ --max-tokens 10 \ --prompt-keys prompt,images \ --fewshot-prompt 5shot \ --extra-columns file_name,answer
For Idefics models and other chat models that support AutoModelForVision2Seq Class, run the idefics_vision code file with output-dir and model-name replaced. 3. Idefics2 eg.Run Idefics2 from HuggingFaceM4/idefics2-8b few shot with one example:
python3 -m prl_ml.batched_lm_generation.idefics_vision \
--dataset "disk:path_to_GlyphPattern_repo/dataset_construction/GlyphPattern/four_choice" \
--output-dir Idefics2-4choice-zeroshot \
--model-name HuggingFaceM4/idefics2-8b \
--temperature 0 \
--batch-size 1 \
--completion-limit 1 \
--prompt-keys prompt,images \
--fewshot-prompt 1shot \
--extra-columns file_name,answer
Idefics3 eg.Run Idefics3 from HuggingFaceM4/Idefics3-8B-Llama3 zero shot:
python3 -m prl_ml.batched_lm_generation.idefics_vision \ --dataset "disk:path_to_GlyphPattern_repo/dataset_construction/GlyphPattern/four_choice" \ --output-dir Idefics2-4choice-zeroshot \ --model-name HuggingFaceM4/idefics2-8b \ --temperature 0 \ --batch-size 1 \ --completion-limit 1 \ --prompt-keys prompt,images \ --extra-columns file_name,answer
LlavaNext eg. Run LlavaNext from llava-hf/llava-v1.6-mistral-7b-hf Zero-shot
python3 -m prl_ml.batched_lm_generation.llavanext_vision \ --dataset "disk:path_to_GlyphPattern_repo/dataset_construction/GlyphPattern/four_choice" \ --output-dir llavanext-4choice-zeroshot \ --model-name llava-hf/llava-v1.6-mistral-7b-hf \ --temperature 0 \ --batch-size 1 \ --completion-limit 1 \ --prompt-keys prompt,images \ --extra-columns file_name,answer
InstructBlip Run InstructBlip from Salesforce/instructblip-vicuna-7b, only zero-shot is supported.
python3 -m prl_ml.batched_lm_generation.instructblip_vision \ --dataset "disk:path_to_GlyphPattern_repo/dataset_construction/GlyphPattern/four_choice" \ --output-dir instructblip-4choice-zeroshot \ --model-name Salesforce/instructblip-vicuna-7b \ --temperature 0 \ --batch-size 1 \ --completion-limit 1 \ --prompt-keys prompt,images \ --extra-columns file_name,answer
We support few-shot free response for the best two performing models, gpt-4o and gemini 1.5. Run code files with suffix freeresponse, and use dataset-split gt_descriptions.
python3 -m prl_ml.batched_lm_generation.gpt4o_freeresponse \ --dataset "disk:path_to_GlyphPattern_repo/dataset_construction/GlyphPattern/gt_descriptions" \ --output-dir gpt4o-free \ --model-name gpt-4o \ --temperature 0 \ --batch-size 1 \ --completion-limit 1 \ --max-tokens 500 \ --prompt-keys prompt,images \ --extra-columns file_name,answer
python3 -m prl_ml.batched_lm_generation.geminipro_freeresponse \ --dataset "disk:path_to_GlyphPattern_repo/dataset_construction/GlyphPattern/gt_descriptions" \ --output-dir geminipro-free-leftright \ --model-name gemini-1.5-pro \ --temperature 0 \ --batch-size 1 \ --completion-limit 1 \ --max-tokens 500 \ --prompt-keys prompt,images \ --extra-columns file_name,answer
python3 -m prl_ml.batched_lm_generation.gpt4o_vision_cot \
--dataset "path_to_GlyphPattern_repo/dataset_construction/GlyphPattern/four_choice" \
--output-dir gpt4o-4choice-fewshot-cot \
--model-name gpt-4o-2024-05-13 \
--temperature 0 \
--batch-size 1 \
--completion-limit 1 \
--max-tokens 1024 \
--prompt-keys prompt,images \
--fewshot-prompt 3shot \
--extra-columns file_name,answer
python3 -m prl_ml.batched_lm_generation.geminipro_vision_cot \
--dataset "path_to_GlyphPattern_repo/dataset_construction/GlyphPattern/four_choice" \
--output-dir geminipro-4choice-fewshot-cot \
--model-name gemini-1.5-pro-001 \
--temperature 0 \
--batch-size 1 \
--completion-limit 1 \
--max-tokens 1024 \
--prompt-keys prompt,images \
--fewshot-prompt 3shot \
--extra-columns file_name,answer
The model results csv files are in model_results directory.
To process the generated results from running experiment above, run the check_answers.py in process_result directory. Replace the result_dir with the model output directory created by running batched_lm_generation.
For example, to check the outputs of running llavanext zero shot multiple choice:
bash python process_results/check_answers.py \ --result_dir llavanext-4choice-zeroshot \ --model_name llavanext
This produces a csv file at where the result directory was.
To calculate accuracy base on visual output, run outputAccuracy.py with the file_path replaced.