Disabling cache #179

psinger · 2023-07-21T13:47:17Z

psinger
Jul 21, 2023

I am a bit confused about how the cache exactly works. I saw the discussion here though:
#155

It requires a lot of gpu memory due to the tensor copies, specifically on longer sequence lengths and larger batches.

Is it somehow possible to disable it?

turboderp · 2023-07-21T20:06:34Z

turboderp
Jul 21, 2023
Maintainer

The cache doesn't require lots of memory due to tensor copies. It requires lots of memory because it's a big list of tensors. Copying in-place actually saves a large amount of memory and bandwidth compared to the HF approach which concatenates the cache for every generated token, a much more expensive operation which also tends to cause memory fragmentation.

The logic behind pre-allocating it instead of building it on the fly is pretty simple: at some point you will need the full-length cache anyway--or if not, you can just pre-allocate a smaller cache. In any case it's better to actually define what batch size and sequence length you have the VRAM to support.

As for running without the cache, there's no support for that currently. The speed would drop by a factor of a hundred or more, so I haven't seen a use case for that in causal LM. I guess maybe sequence classification or something? If you want to try it you could run the forward pass with a dummy cache (allocate it with a sequence length of 1 or something), disable fused attention, and then comment out these lines in the regular attention function:

        # Add keys and values to cache

        new_keys = cache.key_states[self.index].narrow(2, past_len, q_len)
        new_values = cache.value_states[self.index].narrow(2, past_len, q_len)
        new_keys.copy_(key_states)
        new_values.copy_(value_states)

        # Key/value tensors with past

        key_states = cache.key_states[self.index].narrow(2, 0, past_len + q_len)
        value_states = cache.value_states[self.index].narrow(2, 0, past_len + q_len)

Also, the generator logic should change so that instead of running the last token of the sequence through the forward pass, you'll have to run the entire sequence every time.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disabling cache #179

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Disabling cache #179

psinger Jul 21, 2023

Replies: 1 comment

turboderp Jul 21, 2023 Maintainer

psinger
Jul 21, 2023

turboderp
Jul 21, 2023
Maintainer