Replies: 1 comment
-
The cache doesn't require lots of memory due to tensor copies. It requires lots of memory because it's a big list of tensors. Copying in-place actually saves a large amount of memory and bandwidth compared to the HF approach which concatenates the cache for every generated token, a much more expensive operation which also tends to cause memory fragmentation. The logic behind pre-allocating it instead of building it on the fly is pretty simple: at some point you will need the full-length cache anyway--or if not, you can just pre-allocate a smaller cache. In any case it's better to actually define what batch size and sequence length you have the VRAM to support. As for running without the cache, there's no support for that currently. The speed would drop by a factor of a hundred or more, so I haven't seen a use case for that in causal LM. I guess maybe sequence classification or something? If you want to try it you could run the forward pass with a dummy cache (allocate it with a sequence length of 1 or something), disable fused attention, and then comment out these lines in the regular attention function:
Also, the generator logic should change so that instead of running the last token of the sequence through the forward pass, you'll have to run the entire sequence every time. |
Beta Was this translation helpful? Give feedback.
-
I am a bit confused about how the cache exactly works. I saw the discussion here though:
#155
It requires a lot of gpu memory due to the tensor copies, specifically on longer sequence lengths and larger batches.
Is it somehow possible to disable it?
Beta Was this translation helpful? Give feedback.
All reactions