Again, complements on this project! #4

Don-Chad · 2023-05-23T11:59:15Z

Don-Chad
May 23, 2023

Let's get some benchmarks going :-) I'll run it on the 3090 this week, see what this does. Happy to see some more scores.

turboderp · 2023-05-23T12:19:13Z

turboderp
May 23, 2023
Maintainer

Yes, I'd love to see more results. Although, scores at this point are really only useful for guiding optimization, and they'll become irrelevant very quickly.

E.g. I just discovered that torch.matmul is considerably slower than cuBLAS (weird since Torch supposedly uses cuBLAS under the hood), so that will have some implications on the next commit. After that I'm looking at optimizing vector-matrix multiplications since it turns out that even a simple, naive kernel can outperform cuBLAS in that specific case, which happens to be the typical case during autoregression. I'm also chasing a CPU bottleneck right now, of all things.

So it's probably premature to start compiling a big table of scores, unless I had a way to automate the benchmarks with every new revision. Results are still very welcome, of course.

0 replies

ghost · 2023-05-24T04:15:46Z

ghost
May 24, 2023

Agreed, this project is AWESOME! Thank you @turboderp

It would be nice to be able to donate or talk on a discord or smth.

Based on what you've singlehandedly accomplished in such little time, I'd chose this project over lit-llama any day

0 replies

turboderp · 2023-05-24T13:20:04Z

turboderp
May 24, 2023
Maintainer

I don't know if I like the idea of donations. I would end up feeling guilty and stressed, probably. I guess in appreciation of Llama you could click a Facebook ad or something. ;)

It could be fun to start a collection to pay for some server time, though. A big part of the motivation for this project is context length, and I have some preliminary results to suggest Llama can be finetuned to work on long sequences. So just setting up a large training job and letting it run for however many dollars people are willing to chip in... but then I have too much on my plate already. Maybe later.

I am on Discord but I hardly ever use it. Are there any good servers for people working on/with local LLMs?

4 replies

ghost May 24, 2023

Really respectable :) You'll loose hair hahahaha

I love that idea, a pot pool for compute resources for community driven finetuning. It would be awesome to get larger contexts but tbh I'm cool with storywriters 65k tokens as a pre summary llm, especially since there's now a wizzardlm finetune of it. Would exllama ever support MPT-7b-storywriter or any of the other open llama's? They all hold so much potential and are working on larger models.

Tbh there are too many good local llm severs such as nomicAI or lightningAI, really good projects but holy sh*t is it hard to communicate on those discord servers with 1000+ people online. Its just really hard to keep up with whats happens from even just a single night's sleep. Just too much junk to sieve through, I wish there was a hub to chat with each other but also easily keep updated and on track with whats happened when you're away... hmm...

turboderp May 24, 2023
Maintainer

I would definitely support MPT if it catches on. I'd like to see bigger models before it really seems like it's worth the effort, though. As much as you can do with the right finetune, there's still such a big difference between Llama-13B and Llama-30B (yes, I know it's technically 33B) that all of these smaller models just seem like proofs of concept, just a way of validating that the methodology is sound before applying it for real.

I'm sure it's coming, though, and Llama will be supplanted by the next thing, and then there will be a quantized version of that thing and I won't hesitate to add support for it.

ghost May 27, 2023

@turboderp have you seen the falcon 40b model? It comes to par with 65b llama, if not better, and well, its smaller. Its currently on the top of the hugging face leaderboard and thats not even the instruct finetune one, beating out all llama finetunes. Bigger better model just like you said. What do you think?

turboderp May 27, 2023
Maintainer

I've seen it but not had a chance to try it out yet. All I know is that I always take it with a grain of salt when people compare models to one another. Benchmarks don't always tell you all that much.

But yeah, it's no my list somewhere to fire up a RunPod instance and try it out. On an A100 I guess.

cozycold · 2023-06-02T20:46:55Z

cozycold
Jun 2, 2023

I'm planning to create a home server with two 4090 but I'm still hesitant because no software currently available does not support a 65b model with full context length. Also, while about 10 t/s is not bad, getting substantially more would be very nice. So far, I'm really impressed by your work @turboderp (and by the other contributors!), and am looking forward to what this project is going to achieve with respect to memory footprint and speed.

4 replies

turboderp Jun 2, 2023
Maintainer

Well, I got a 3090-Ti to go with my 4090, and I can run Llama-65B with a full context quite well. Getting up to 19 tokens/second without any optimizations that take the 3090 into account. I can even just squeeze in a groupsize-32 version.

I'm pretty confident I can get close to double that performance in time, by splitting each layer between GPUs (instead of putting half the layers on each GPU). Although, it requires a lot of careful synchronization so it means bypassing a lot more of PyTorch and a lot of rewriting of the C++/CUDA stuff. All this stuff takes time, and while it's going on the project isn't going to exactly be stable.

So whether you want to rely on this project is up to you, but it's definitely possible to run 65B GPTQ on two 4090s.

Oh, and also there aren't a lot of 65B models to choose from. Like, I'm really missing a WizardLM-uncensored.

Si13x Jun 2, 2023

Popping in here real quick to voice extreme interest in those potential gains for multi-GPU support, @turboderp -- my two 3090s would love to push more tokens faster on Llama-65B. Also, thank you so much for all the incredible work you're doing on this project as a whole, I've really been enjoying both using exllama and reading your development discussions. :-)

cozycold Jun 3, 2023

Oh, and also there aren't a lot of 65B models to choose from. Like, I'm really missing a WizardLM-uncensored.

I sometimes wonder whether it makes sense to use derivative models or whether it may not be better to just use the original model and master prompting it. In any case, as long as I don't have viable hardware, I develop with ChatGPT. Later I'd like to experiment with the original llama or maybe (due to legal reasons) with one of the actual open source models. That said, I do not want to rely on a small model. Either go big or go home - well, up to a sensible size like two or at most three 4090 cards.

I'm pretty confident I can get close to double that performance in time, by splitting each layer between GPUs (instead of putting half the layers on each GPU). Although, it requires a lot of careful synchronization so it means bypassing a lot more of PyTorch and a lot of rewriting of the C++/CUDA stuff. All this stuff takes time, and while it's going on the project isn't going to exactly be stable.

I really need to learn how inference on these models actually works. This sounds super interesting.

turboderp Jun 3, 2023
Maintainer

I sometimes wonder whether it makes sense to use derivative models or whether it may not be better to just use the original model and master prompting it.

It's really hard to tell sometimes. The base model is already very capable, and with the right prompts prompts it can produce really good results. There's also soft prompts and LoRAs to consider. But I have been seeing some very good results with WizardLM-uncensored on 30B. It just takes less persuasion to give reasonable outputs, though that isn't saying there aren't use cases where the base model would perform even better.

I really need to learn how inference on these models actually works. This sounds super interesting.

With regards to the dual GPU stuff, it's just about how you perform matrix multiplication efficiently. Splitting across GPUs just involves doing half of the multiplication on each GPU and then copying the results across when you're done. PCIe bandwidth becomes a bottleneck, except when you're working with a single-row hidden state and then the result that has to be synchronized is also a single row, just a few kB of data. So it could double tokens/second for generation, but maybe prompt evaluation would become unreasonably slow.

As for how inference works, I still always recommend Andrej Karpathy's video series on GPT as a starting point. It's probably the best introduction to transformers you could hope for.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Again, complements on this project! #4

{{title}}

Replies: 4 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Again, complements on this project! #4

Don-Chad May 23, 2023

Replies: 4 comments · 8 replies

turboderp May 23, 2023 Maintainer

ghost May 24, 2023

turboderp May 24, 2023 Maintainer

ghost May 24, 2023

turboderp May 24, 2023 Maintainer

ghost May 27, 2023

turboderp May 27, 2023 Maintainer

cozycold Jun 2, 2023

turboderp Jun 2, 2023 Maintainer

Si13x Jun 2, 2023

cozycold Jun 3, 2023

turboderp Jun 3, 2023 Maintainer

Don-Chad
May 23, 2023

Replies: 4 comments 8 replies

turboderp
May 23, 2023
Maintainer

ghost
May 24, 2023

turboderp
May 24, 2023
Maintainer

turboderp May 24, 2023
Maintainer

turboderp May 27, 2023
Maintainer

cozycold
Jun 2, 2023

turboderp Jun 2, 2023
Maintainer

turboderp Jun 3, 2023
Maintainer