Replies: 4 comments 8 replies
-
Yes, I'd love to see more results. Although, scores at this point are really only useful for guiding optimization, and they'll become irrelevant very quickly. E.g. I just discovered that torch.matmul is considerably slower than cuBLAS (weird since Torch supposedly uses cuBLAS under the hood), so that will have some implications on the next commit. After that I'm looking at optimizing vector-matrix multiplications since it turns out that even a simple, naive kernel can outperform cuBLAS in that specific case, which happens to be the typical case during autoregression. I'm also chasing a CPU bottleneck right now, of all things. So it's probably premature to start compiling a big table of scores, unless I had a way to automate the benchmarks with every new revision. Results are still very welcome, of course. |
Beta Was this translation helpful? Give feedback.
-
Agreed, this project is AWESOME! Thank you @turboderp It would be nice to be able to donate or talk on a discord or smth. Based on what you've singlehandedly accomplished in such little time, I'd chose this project over lit-llama any day |
Beta Was this translation helpful? Give feedback.
-
I don't know if I like the idea of donations. I would end up feeling guilty and stressed, probably. I guess in appreciation of Llama you could click a Facebook ad or something. ;) It could be fun to start a collection to pay for some server time, though. A big part of the motivation for this project is context length, and I have some preliminary results to suggest Llama can be finetuned to work on long sequences. So just setting up a large training job and letting it run for however many dollars people are willing to chip in... but then I have too much on my plate already. Maybe later. I am on Discord but I hardly ever use it. Are there any good servers for people working on/with local LLMs? |
Beta Was this translation helpful? Give feedback.
-
I'm planning to create a home server with two 4090 but I'm still hesitant because no software currently available does not support a 65b model with full context length. Also, while about 10 t/s is not bad, getting substantially more would be very nice. So far, I'm really impressed by your work @turboderp (and by the other contributors!), and am looking forward to what this project is going to achieve with respect to memory footprint and speed. |
Beta Was this translation helpful? Give feedback.
-
Let's get some benchmarks going :-) I'll run it on the 3090 this week, see what this does. Happy to see some more scores.
Beta Was this translation helpful? Give feedback.
All reactions