Skip to content

Commit

Permalink
Checking in code
Browse files Browse the repository at this point in the history
  • Loading branch information
anilkram committed Nov 10, 2024
1 parent 3bf0c69 commit 843d2b8
Show file tree
Hide file tree
Showing 273 changed files with 69,134 additions and 2 deletions.
Binary file added AP_sampling/.README.md.swn
Binary file not shown.
51 changes: 51 additions & 0 deletions AP_sampling/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Explaining and Improving Contrastive Decoding by Extrapolating the Probabilities of a Huge and Hypothetical LM

<p align="center"><img src="https://github.com/amazon-science/llm-asymptotic-decoding/blob/master/AP_sampling/imgs/APD_first_figure.png?raw=true" width="1586" height="1402"></p>

## Introduction

To overcome the limitation of contrastive decoding (CD), we propose a new unsupervised decoding method called **A**symptotic **P**robability **D**ecoding (APD). APD explicitly extrapolates the probability curves from the LMs of different sizes to infer the asymptotic probabilities from an infinitely large LM without inducing more inference costs than CD. In FactualityPrompts, an open-ended text generation benchmark, sampling using APD significantly boosts factuality in comparison to the CD sampling and its variants, and achieves state-of-the-art results for Pythia 6.9B and OPT 6.7B. Furthermore, in five commonsense QA datasets, APD is often significantly better than CD and achieves a similar effect of using a larger LLM. For example, the perplexity of APD on top of Pythia 6.9B is even lower than the perplexity of Pythia 12B in CommonsenseQA and LAMBADA.


## Computational Environment

You can reproduce our python enviroment using
```
conda create --name <env> --file requirement.txt
```
Most of the codes could also be run using older versions (e.g., the version in the REAL_sampling/requirement.txt) of huggingface except for running the Qwen LLM

## How to run APD

To learn how to use APD and/or REAL sampling in huggingface, please see the following example code

```
./src/example_APD_REAL.py
```

### Run FactualityPrompts

To evaluate the generation results, first follow ../FactualityPrompt/README.md to download the data, change ../FactualityPrompt/src/const.py and run the following script.

If you have >7 GPUs in your machine, you can just run the following file to generate the contiunations.
```
./bin/continue_wiki_prompt_loop_eval.sh
```

### Run Question Answering Datasets

Step 1: Run the dataset download codes at src/QA/dataset_preparation (For ARC, we concatenate the easy and challenge json output).

Step 2: Test APD models on the datasets. For datasets with only positive answers (e.g., LAMBADA, SQuAD, and MultiRC), use src/QA/dataset_preparation/test_squad_dataset.py. For the datasets with negative answers (e.g., QASC, ARC, SocialIQA, and CommonsenceQA), use src/QA/dataset_preparation/test_neg_dataset.py . If you want to also test the APD on the fly baseline, use test_squad_dataset_online_all.py and test_neg_dataset_online_all.py instead. Remember to change the paths in each file accordingly.

Step 3: Run analyze_results.py or analyze_results_online_all.py to collect results. For datasets that have negative answers and accuracy metrics, set have_acc to be 1.


## How to Train ALM' (in order to use APD)

Put your text file into "data/raw/".

Change the INPUT_FILE, data_folder_name, and OUTPUT_MODEL_FOLDER in bin/finetune_ALM.sh and run it (Assuming you have more than 7 GPUs in your machine).

Notice that our current implementation will first save lots of probabilities and logits from the top tokens of various LLMs into a cache, which will take lot of disk space.
And we also need lots of CPU memory to load these probabilities. For example, after process ~270M Wikipedia text using 5 OPT models, we store 70G tensor and 52G dataset cache and our server has around 750G cpu memory.
Binary file added AP_sampling/bin/.finetune_ALM.sh.swp
Binary file not shown.
51 changes: 51 additions & 0 deletions AP_sampling/bin/collect_top_prob.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
#!/bin/bash
#top_k=10
bptt=1024

#data_folder_name="wiki2021_1e4_Pythia"
#data_folder_name="ROC_gen_1000_p095_Pythia"
#data_folder_name="news_gen_1000_p095_Pythia"
#data_folder_name="wp_gen_1000_p095_Pythia"
#data_folder_name="wiki2021_1e6_Pythia"
data_folder_name="wiki2021_5e6_Pythia"
#data_folder_name="ROC_spring_Pythia"
#data_folder_name="wikinews_Pythia"
#data_folder_name="wp_5000_Pythia"
#data_folder_name="wp_20000_Pythia"
#data_folder_name="wiki2021_1e5_Pythia"

#top_k="10"
#sampling_methods="10_20"

top_k="20,5,10"
sampling_methods="0_20,20_100,100_inf"
#top_k="20,20,20"
#sampling_methods="0_20,20_100,100_inf"

top_w_idx_model_name="EleutherAI/pythia-6.9b-deduped"
output_folder="data/processed/$data_folder_name/prob_tensor_${bptt}_ext2"
#output_folder="data/processed/$data_folder_name/prob_tensor_${bptt}_ext3"
#input_folder_name="../true_entropy/data/processed/$data_folder_name"
input_folder_name="data/processed/$data_folder_name"

declare -a bsz_arr=(2 4 4 8 12 16)
declare -a model_arr=("EleutherAI/pythia-2.8b-deduped" "EleutherAI/pythia-1.4b-deduped" "EleutherAI/pythia-1b-deduped" "EleutherAI/pythia-410m-deduped" "EleutherAI/pythia-160m-deduped" "EleutherAI/pythia-70m-deduped" )

model_name="EleutherAI/pythia-6.9b-deduped"
batch_size=1
cuda_init=0
echo "python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $cuda_init --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt"
python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $cuda_init --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt

pids=()

for i in "${!model_arr[@]}";
do
model_name=${model_arr[$i]}
batch_size=${bsz_arr[$i]}
echo "python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $i --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt"
python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $i --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt &
pids+=($!)
done
echo "${pids[@]}"

55 changes: 55 additions & 0 deletions AP_sampling/bin/collect_top_prob_Qwen_4b.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
#!/bin/bash
#bptt=1024
bptt=128

#data_folder_name="ROC_gen_1000_p095_OPT"
#data_folder_name="news_gen_1000_p095_OPT"
#data_folder_name="wp_gen_1000_p095_OPT"
#data_folder_name="openwebtext_2017_18_1e5_OPT"
#data_folder_name="wiki2021_1e6_OPT"
data_folder_name="wiki2021_1e6_Qwen"
#data_folder_name="wiki2021_5e6_OPT"
#data_folder_name="ROC_spring_OPT"
#data_folder_name="wikinews_OPT"
#data_folder_name="wp_5000_OPT"
#data_folder_name="wp_20000_OPT"
#data_folder_name="wiki2021_1e5_OPT"

#top_k="10"
#sampling_methods="10_20"
top_k="20,5,10"
sampling_methods="0_20,20_100,100_inf"

#top_w_idx_model_name="EleutherAI/pythia-6.9b-deduped"
#top_w_idx_model_name="facebook/opt-6.7b"
top_w_idx_model_name="Qwen/Qwen1.5-4b"
#top_w_idx_model_name="Qwen/Qwen1.5-4b-Chat"
#output_folder="data/processed/$data_folder_name/prob_opt_tensor_$bptt"
output_folder="data/processed/$data_folder_name/prob_Qwen_4b_tensor_${bptt}_new"
#output_folder="data/processed/$data_folder_name/prob_Qwen_4b-Chat_tensor_${bptt}_new"
#input_folder_name="../true_entropy/data/processed/$data_folder_name"
input_folder_name="data/processed/$data_folder_name"

declare -a bsz_arr=(4 8)
declare -a model_arr=("Qwen/Qwen1.5-1.8b" "Qwen/Qwen1.5-0.5b" )
#declare -a model_arr=("Qwen/Qwen1.5-1.8b-Chat" "Qwen/Qwen1.5-0.5b-Chat" )

model_name="Qwen/Qwen1.5-4b"
#model_name="Qwen/Qwen1.5-4b-Chat"
batch_size=2
cuda_init=0
echo "python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $cuda_init --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt"
python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $cuda_init --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt

pids=()

for i in "${!model_arr[@]}";
do
model_name=${model_arr[$i]}
batch_size=${bsz_arr[$i]}
echo "python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $i --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt"
python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $i --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt &
pids+=($!)
done
echo "${pids[@]}"

48 changes: 48 additions & 0 deletions AP_sampling/bin/collect_top_prob_opt.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
#!/bin/bash
bptt=1024

#data_folder_name="ROC_gen_1000_p095_OPT"
#data_folder_name="news_gen_1000_p095_OPT"
#data_folder_name="wp_gen_1000_p095_OPT"
#data_folder_name="openwebtext_2017_18_1e5_OPT"
#data_folder_name="wiki2021_1e6_OPT"
data_folder_name="wiki2021_5e6_OPT"
#data_folder_name="ROC_spring_OPT"
#data_folder_name="wikinews_OPT"
#data_folder_name="wp_5000_OPT"
#data_folder_name="wp_20000_OPT"
#data_folder_name="wiki2021_1e5_OPT"

#top_k="10"
#sampling_methods="10_20"
top_k="20,5,10"
sampling_methods="0_20,20_100,100_inf"

#top_w_idx_model_name="EleutherAI/pythia-6.9b-deduped"
top_w_idx_model_name="facebook/opt-6.7b"
#output_folder="data/processed/$data_folder_name/prob_opt_tensor_$bptt"
output_folder="data/processed/$data_folder_name/prob_opt_tensor_${bptt}_new"
#input_folder_name="../true_entropy/data/processed/$data_folder_name"
input_folder_name="data/processed/$data_folder_name"

declare -a bsz_arr=(2 4 8 16)
declare -a model_arr=("facebook/opt-2.7b" "facebook/opt-1.3b" "facebook/opt-350m" "facebook/opt-125m" )

model_name="facebook/opt-6.7b"
batch_size=1
cuda_init=0
echo "python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $cuda_init --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt"
python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $cuda_init --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt

pids=()

for i in "${!model_arr[@]}";
do
model_name=${model_arr[$i]}
batch_size=${bsz_arr[$i]}
echo "python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $i --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt"
python src/collect_top_prob.py --model_name=$model_name --top_w_idx_model_name=$top_w_idx_model_name --input_folder_name $input_folder_name --output_folder $output_folder --cuda_idx $i --batch_size $batch_size --top_k $top_k --sampling_methods $sampling_methods --bptt $bptt &
pids+=($!)
done
echo "${pids[@]}"

Loading

0 comments on commit 843d2b8

Please sign in to comment.