Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add code for evaluating pass @ k to inference_and_check #64

Closed
wants to merge 34 commits into from

Conversation

erictang000
Copy link
Collaborator

Fixes minor bug in perform_check and adds code for checking pass @ k metric for n > 1 samples.

For example if we run the following with a saved file DeepSeek-R1-Distill-Qwen-7B_aime_train_None_False_0_-1.json with n=128 examples per question

python inference_and_check.py --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --task aime  --split train --max_tokens 32768   --inference --n 128  --temperatures 0.6 --tp 1 --check

We will get the following output now:

Temperature: [0.6]
Loaded 30 existing results.
Found 3840 responses requiring reject sampling...
Processing Reject Sampling: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3840/3840 [00:02<00:00, 1432.24it/s]
Final reject-sampling accuracy: 2052/3840
Actual accuracy: 0.534375
Final pass @ k:
k: 128, pass @ k: 90.0
k: 64, pass @ k: 84.999
k: 32, pass @ k: 82.379
k: 16, pass @ k: 80.524
k: 8, pass @ k: 78.576
k: 4, pass @ k: 74.26
k: 2, pass @ k: 65.496
k: 1, pass @ k: 53.438
Temperature 0.6 acc: 27/30 (0.9)

@erictang000 erictang000 marked this pull request as draft February 4, 2025 22:33
@erictang000 erictang000 closed this Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant