Improve memory usage by properly cleaning up weights as quantized #16

mgoin · 2024-06-13T16:23:59Z

Before this PR, we would run out of memory quantizing the weights of Llama 3 70B on two H100 80GB GPUs.

This is because as we were quantizing the weights, we were holding references to the original Linear modules such that PyTorch wouldn't free everything. Now we explicitly clone the weights and biases to then delete all of the original Parameters, as we quantize each module. This seems to massively improve peak memory usage and essentially makes it not an issue beyond the initial unquantized checkpoint load.

Improve memory usage by properly cleaning up weights as quantized

9b8abad

This was linked to issues Jun 13, 2024

Quantization of Mixtral 8x22B #13

Closed

CUDA out of memory. Tried to allocate 462.00 MiB. GPU #15

Closed

mgoin mentioned this pull request Jun 13, 2024

CUDA out of memory. Tried to allocate 462.00 MiB. GPU #15

Closed

mgoin merged commit 2e134d8 into main Jun 13, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve memory usage by properly cleaning up weights as quantized #16

Improve memory usage by properly cleaning up weights as quantized #16

mgoin commented Jun 13, 2024 •

edited

Loading

Improve memory usage by properly cleaning up weights as quantized #16

Improve memory usage by properly cleaning up weights as quantized #16

Conversation

mgoin commented Jun 13, 2024 • edited Loading

mgoin commented Jun 13, 2024 •

edited

Loading