Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added triangular matrix multiplication kernel #214

Merged
merged 2 commits into from
Apr 22, 2024
Merged

Conversation

ngc92
Copy link
Contributor

@ngc92 ngc92 commented Apr 22, 2024

Companion to #213, adding a file specifically for the development of this matmul.
Also shows different intermediate kernels on the way towards efficiency.

To give a break from all the maths and indexing in the code, the development of these is described as a story.
Some of the metaphors are stretched quite a bit, so feel free to make adjustments, but I hope that overall, this might be easier to follow than just "indexing with this formula to achieve coalesced access".

Currently, the reads in the inner loop still cause 2-way bank conflicts, so there is still room for improvement.

Timings on my machine:

time 1.40 ms vs 2.37 ms for CuBLAS

Given that we're doing only half the work, that leaves us still 20% less efficient than cuBLAS.

@karpathy
Copy link
Owner

Wow, you really had a lot of fun with the TriMatlon 😂 😂 😂
The most incredible fusion of art and engineering I've seen yet :D

@karpathy karpathy merged commit 7830cf6 into karpathy:master Apr 22, 2024
3 checks passed
@ngc92 ngc92 deleted the trimul branch April 28, 2024 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants