Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve memory usage by properly cleaning up weights as quantized #16

Merged
merged 1 commit into from
Jun 13, 2024

Conversation

mgoin
Copy link
Member

@mgoin mgoin commented Jun 13, 2024

Before this PR, we would run out of memory quantizing the weights of Llama 3 70B on two H100 80GB GPUs.

This is because as we were quantizing the weights, we were holding references to the original Linear modules such that PyTorch wouldn't free everything. Now we explicitly clone the weights and biases to then delete all of the original Parameters, as we quantize each module. This seems to massively improve peak memory usage and essentially makes it not an issue beyond the initial unquantized checkpoint load.

@mgoin mgoin merged commit 2e134d8 into main Jun 13, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CUDA out of memory. Tried to allocate 462.00 MiB. GPU Quantization of Mixtral 8x22B
1 participant