From e12eb6c4ee89d3c6efd58be0a9dec2fd24ac37a3 Mon Sep 17 00:00:00 2001
From: glyh <lyhokia@gmail.com>
Date: Sun, 2 Feb 2025 11:49:07 +0800
Subject: [PATCH 1/2] fix typo: attention layer should use masked_softmax
 instead of softmax

---
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index d1ce12c..d802a52 100644
--- a/README.md
+++ b/README.md
@@ -135,7 +135,7 @@ residual = output + residual
 
 第二，我们需要将 (seq_len, dim) 和 (dim, total_seq_len) 的两个矩阵做矩阵乘才能得到我们想要的形状，而现在的QK都不满足这个条件；你有几种不同的选择处理这个情况，一是对矩阵进行reshape和转置（意味着拷贝），再用一个支持广播（因为你需要对“头”进行正确对应）的矩阵乘进行计算，二是将这些矩阵视为多个向量，并按照正确的对应关系手动进行索引和向量乘法，这里我推荐使用更容易理解的后一种方法。
 
-同样的，在对权重矩阵进行完softmax后和V进行矩阵乘时也会遇到这个情况。
+同样的，在对权重矩阵进行完masked_softmax后和V进行矩阵乘时也会遇到这个情况。
 
 对于每个头，完整的Self-Attention层的计算过程如下；
 
@@ -148,7 +148,7 @@ K = cat(K_cache, K)
 V = cat(V_cache, V)
 ### 以下是你需要实现的部分
 score = Q @ K.T / sqrt(dim)
-attn = softmax(score)
+attn = masked_softmax(score)
 attn_V = attn @ V
 out = attn_V @ O_weight.T
 residual = out + residual

From 821c02495133b6b827305c3819717c6eb7b08227 Mon Sep 17 00:00:00 2001
From: glyh <lyhokia@gmail.com>
Date: Sun, 2 Feb 2025 15:16:21 +0800
Subject: [PATCH 2/2] clarify self-attention implementation

---
 README.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index d802a52..0e8c015 100644
--- a/README.md
+++ b/README.md
@@ -146,10 +146,11 @@ K = RoPE(x @ K_weight.T)
 V = x @ V_weight.T
 K = cat(K_cache, K)
 V = cat(V_cache, V)
-### 以下是你需要实现的部分
+### 以下是你需要在函数self_attention中实现的部分
 score = Q @ K.T / sqrt(dim)
 attn = masked_softmax(score)
 attn_V = attn @ V
+### 以下是你需要在"down_proj matmul and add residual"处实现的部分
 out = attn_V @ O_weight.T
 residual = out + residual
 ```