Checking in code

amazon-science · Nov 10, 2024 · 843d2b8 · 843d2b8
1 parent 3bf0c69
commit 843d2b8
Show file tree

Hide file tree

Showing 273 changed files with 69,134 additions and 2 deletions.
diff --git a/AP_sampling/.README.md.swn b/AP_sampling/.README.md.swn
diff --git a/AP_sampling/README.md b/AP_sampling/README.md
@@ -0,0 +1,51 @@
+# Explaining and Improving Contrastive Decoding by Extrapolating the Probabilities of a Huge and Hypothetical LM
+
+<p align="center"><img src="https://github.com/amazon-science/llm-asymptotic-decoding/blob/master/AP_sampling/imgs/APD_first_figure.png?raw=true" width="1586" height="1402"></p>
+
+## Introduction
+
+To overcome the limitation of contrastive decoding (CD), we propose a new unsupervised decoding method called **A**symptotic **P**robability **D**ecoding (APD). APD explicitly extrapolates the probability curves from the LMs of different sizes to infer the asymptotic probabilities from an infinitely large LM without inducing more inference costs than CD. In FactualityPrompts, an open-ended text generation benchmark, sampling using APD significantly boosts factuality in comparison to the CD sampling and its variants, and achieves state-of-the-art results for Pythia 6.9B and OPT 6.7B. Furthermore, in five commonsense QA datasets, APD is often significantly better than CD and achieves a similar effect of using a larger LLM. For example, the perplexity of APD on top of Pythia 6.9B is even lower than the perplexity of Pythia 12B in CommonsenseQA and LAMBADA.
+
+
+## Computational Environment
+
+You can reproduce our python enviroment using
+```
+conda create --name <env> --file requirement.txt
+```
+Most of the codes could also be run using older versions (e.g., the version in the REAL_sampling/requirement.txt) of huggingface except for running the Qwen LLM
+
+## How to run APD
+
+To learn how to use APD and/or REAL sampling in huggingface, please see the following example code
+
+```
+./src/example_APD_REAL.py
+```
+
+### Run FactualityPrompts
+
+To evaluate the generation results, first follow ../FactualityPrompt/README.md to download the data, change ../FactualityPrompt/src/const.py and run the following script.
+
+If you have >7 GPUs in your machine, you can just run the following file to generate the contiunations.
+```
+./bin/continue_wiki_prompt_loop_eval.sh
+```
+
+### Run Question Answering Datasets
+
+Step 1: Run the dataset download codes at src/QA/dataset_preparation (For ARC, we concatenate the easy and challenge json output).
+
+Step 2: Test APD models on the datasets. For datasets with only positive answers (e.g., LAMBADA, SQuAD, and MultiRC), use src/QA/dataset_preparation/test_squad_dataset.py. For the datasets with negative answers (e.g., QASC, ARC, SocialIQA, and CommonsenceQA), use src/QA/dataset_preparation/test_neg_dataset.py . If you want to also test the APD on the fly baseline, use test_squad_dataset_online_all.py and test_neg_dataset_online_all.py instead. Remember to change the paths in each file accordingly.
+
+Step 3: Run analyze_results.py or analyze_results_online_all.py to collect results. For datasets that have negative answers and accuracy metrics, set have_acc to be 1.
+
+
+## How to Train ALM' (in order to use APD)
+
+Put your text file into "data/raw/".
+
+Change the INPUT_FILE, data_folder_name, and OUTPUT_MODEL_FOLDER in bin/finetune_ALM.sh and run it (Assuming you have more than 7 GPUs in your machine).
+
+Notice that our current implementation will first save lots of probabilities and logits from the top tokens of various LLMs into a cache, which will take lot of disk space. 
+And we also need lots of CPU memory to load these probabilities. For example, after process ~270M Wikipedia text using 5 OPT models, we store 70G tensor and 52G dataset cache and our server has around 750G cpu memory. 
diff --git a/AP_sampling/bin/.finetune_ALM.sh.swp b/AP_sampling/bin/.finetune_ALM.sh.swp
diff --git a/AP_sampling/bin/collect_top_prob.sh b/AP_sampling/bin/collect_top_prob.sh
@@ -0,0 +1,51 @@
+#!/bin/bash
+#top_k=10
+bptt=1024
+
+#data_folder_name="wiki2021_1e4_Pythia"
+#data_folder_name="ROC_gen_1000_p095_Pythia"
+#data_folder_name="news_gen_1000_p095_Pythia"
+#data_folder_name="wp_gen_1000_p095_Pythia"
+#data_folder_name="wiki2021_1e6_Pythia"
+data_folder_name="wiki2021_5e6_Pythia"
+#data_folder_name="ROC_spring_Pythia"
+#data_folder_name="wikinews_Pythia"
+#data_folder_name="wp_5000_Pythia"
+#data_folder_name="wp_20000_Pythia"
+#data_folder_name="wiki2021_1e5_Pythia"
+
+#top_k="10"
+#sampling_methods="10_20"
+
+top_k="20,5,10"
+sampling_methods="0_20,20_100,100_inf"
+#top_k="20,20,20"
+#sampling_methods="0_20,20_100,100_inf"
+
+top_w_idx_model_name="EleutherAI/pythia-6.9b-deduped"
+output_folder="data/processed/$data_folder_name/prob_tensor_${bptt}_ext2"
+#output_folder="data/processed/$data_folder_name/prob_tensor_${bptt}_ext3"
+#input_folder_name="../true_entropy/data/processed/$data_folder_name"
+input_folder_name="data/processed/$data_folder_name"
+
+declare -a bsz_arr=(2 4 4 8 12 16)
+declare -a model_arr=("EleutherAI/pythia-2.8b-deduped" "EleutherAI/pythia-1.4b-deduped" "EleutherAI/pythia-1b-deduped" "EleutherAI/pythia-410m-deduped" "EleutherAI/pythia-160m-deduped" "EleutherAI/pythia-70m-deduped" )
+
+model_name="EleutherAI/pythia-6.9b-deduped"
+batch_size=1
+cuda_init=0
+echo "python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $cuda_init --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt"
+python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $cuda_init --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt
+
+pids=()
+
+for i in "${!model_arr[@]}";
+do
+	model_name=${model_arr[$i]}
+	batch_size=${bsz_arr[$i]}
+	echo "python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $i --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt"
+	python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $i --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt &
+	pids+=($!)
+done
+echo "${pids[@]}"
+
diff --git a/AP_sampling/bin/collect_top_prob_Qwen_4b.sh b/AP_sampling/bin/collect_top_prob_Qwen_4b.sh
@@ -0,0 +1,55 @@
+#!/bin/bash
+#bptt=1024
+bptt=128
+
+#data_folder_name="ROC_gen_1000_p095_OPT"
+#data_folder_name="news_gen_1000_p095_OPT"
+#data_folder_name="wp_gen_1000_p095_OPT"
+#data_folder_name="openwebtext_2017_18_1e5_OPT"
+#data_folder_name="wiki2021_1e6_OPT"
+data_folder_name="wiki2021_1e6_Qwen"
+#data_folder_name="wiki2021_5e6_OPT"
+#data_folder_name="ROC_spring_OPT"
+#data_folder_name="wikinews_OPT"
+#data_folder_name="wp_5000_OPT"
+#data_folder_name="wp_20000_OPT"
+#data_folder_name="wiki2021_1e5_OPT"
+
+#top_k="10"
+#sampling_methods="10_20"
+top_k="20,5,10"
+sampling_methods="0_20,20_100,100_inf"
+
+#top_w_idx_model_name="EleutherAI/pythia-6.9b-deduped"
+#top_w_idx_model_name="facebook/opt-6.7b"
+top_w_idx_model_name="Qwen/Qwen1.5-4b"
+#top_w_idx_model_name="Qwen/Qwen1.5-4b-Chat"
+#output_folder="data/processed/$data_folder_name/prob_opt_tensor_$bptt"
+output_folder="data/processed/$data_folder_name/prob_Qwen_4b_tensor_${bptt}_new"
+#output_folder="data/processed/$data_folder_name/prob_Qwen_4b-Chat_tensor_${bptt}_new"
+#input_folder_name="../true_entropy/data/processed/$data_folder_name"
+input_folder_name="data/processed/$data_folder_name"
+
+declare -a bsz_arr=(4 8)
+declare -a model_arr=("Qwen/Qwen1.5-1.8b" "Qwen/Qwen1.5-0.5b" )
+#declare -a model_arr=("Qwen/Qwen1.5-1.8b-Chat" "Qwen/Qwen1.5-0.5b-Chat" )
+
+model_name="Qwen/Qwen1.5-4b"
+#model_name="Qwen/Qwen1.5-4b-Chat"
+batch_size=2
+cuda_init=0
+echo "python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $cuda_init --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt"
+python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $cuda_init --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt
+
+pids=()
+
+for i in "${!model_arr[@]}";
+do
+	model_name=${model_arr[$i]}
+	batch_size=${bsz_arr[$i]}
+	echo "python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $i --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt"
+	python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $i --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt &
+	pids+=($!)
+done
+echo "${pids[@]}"
+
diff --git a/AP_sampling/bin/collect_top_prob_opt.sh b/AP_sampling/bin/collect_top_prob_opt.sh
@@ -0,0 +1,48 @@
+#!/bin/bash
+bptt=1024
+
+#data_folder_name="ROC_gen_1000_p095_OPT"
+#data_folder_name="news_gen_1000_p095_OPT"
+#data_folder_name="wp_gen_1000_p095_OPT"
+#data_folder_name="openwebtext_2017_18_1e5_OPT"
+#data_folder_name="wiki2021_1e6_OPT"
+data_folder_name="wiki2021_5e6_OPT"
+#data_folder_name="ROC_spring_OPT"
+#data_folder_name="wikinews_OPT"
+#data_folder_name="wp_5000_OPT"
+#data_folder_name="wp_20000_OPT"
+#data_folder_name="wiki2021_1e5_OPT"
+
+#top_k="10"
+#sampling_methods="10_20"
+top_k="20,5,10"
+sampling_methods="0_20,20_100,100_inf"
+
+#top_w_idx_model_name="EleutherAI/pythia-6.9b-deduped"
+top_w_idx_model_name="facebook/opt-6.7b"
+#output_folder="data/processed/$data_folder_name/prob_opt_tensor_$bptt"
+output_folder="data/processed/$data_folder_name/prob_opt_tensor_${bptt}_new"
+#input_folder_name="../true_entropy/data/processed/$data_folder_name"
+input_folder_name="data/processed/$data_folder_name"
+
+declare -a bsz_arr=(2 4 8 16)
+declare -a model_arr=("facebook/opt-2.7b" "facebook/opt-1.3b" "facebook/opt-350m" "facebook/opt-125m" )
+
+model_name="facebook/opt-6.7b"
+batch_size=1
+cuda_init=0
+echo "python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $cuda_init --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt"
+python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $cuda_init --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt
+
+pids=()
+
+for i in "${!model_arr[@]}";
+do
+	model_name=${model_arr[$i]}
+	batch_size=${bsz_arr[$i]}
+	echo "python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $i --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt"
+	python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $i --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt &
+	pids+=($!)
+done
+echo "${pids[@]}"
+