Skip to content

Commit

Permalink
docs: 📝 Added theoretical derivation for REINFORCE with baseline
Browse files Browse the repository at this point in the history
  • Loading branch information
Phoenix-Shen committed Aug 16, 2022
1 parent cfcb363 commit f4f695e
Show file tree
Hide file tree
Showing 3 changed files with 54 additions and 8 deletions.
9 changes: 5 additions & 4 deletions ActorCritic(AC)/readme.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
Actor-Critic 算法
结合了 policy gradient 和 function approximation 的方法<br>
# Actor-Critic 算法

结合了 policy gradient 和 function approximation 的方法

---

## **简单来说,就是将 vt 从固定值换成神经网络生成的结果**

```
```python
action=Actior(observation)
score=Critic(action)

Expand All @@ -21,7 +22,7 @@ Actor-Critic 涉及到了两个神经网络, 而且每次都是在连续状态

1. 使用状态价值 state value
2. 使用动作-状态价值 state-action value
3. 基于 TD error(本代码中的方法) tderror=r*t+1+ gamma\*Vs_t+1 * Vs_t
3. 基于 TD error(本代码中的方法) tderror=r*t+1+ gamma\*Vs_t+1* Vs_t
4. 基于优势函数 Advantage = state-action value-state value
5. 基于 TD(λ)误差

Expand Down
5 changes: 2 additions & 3 deletions PolicyGradient(PG)/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
---

- 不同于 Qlearning,它直接输出的是动作而不是动作对应的 value,但是他也要接受环境信息 observation
<br>

- Policy π 是一个网络,他的参数是 θ,输入是环境的 state,输出动作的几率分布(distribution of probability)

Expand All @@ -31,7 +30,7 @@

---

```
```python
class PolicyGradient(nn.Module):
def __init__(self) -> None:
super().__init__()
Expand Down Expand Up @@ -65,7 +64,7 @@ pytorch 写出来了一个,但是梯度下降不下去,不知道为什么,

## 关于 Log probability

![](./log.png)
![log_prob](./log.png)

## 更新网络参数

Expand Down
48 changes: 47 additions & 1 deletion readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -679,7 +679,53 @@ Policy Gradient**算法细节**
由于$Q_{\pi}(s_t,a_t) = \mathbb{E}[U_t]$,我们可以使用$u_t$来去近似$Q_{\pi}(s_t,a_t)$,这种方法显而易见有一个缺点:玩完一局游戏才能进行更新,低效。
2. 使用神经网络去近似$Q_{\pi}$

这就是下面的ActorCritic Methods
这就是下面的[ActorCritic Methods](#2-actor-critic-on-policy)

REINFORCE with Baseline

- 关于baseline可以在[策略学习](#3-策略学习-policy-based-learning---学习策略-pias)这里看到

- 我们有随机策略梯度:
$$
\mathbf{g}(a_t) = \frac{\partial \ln \pi(a_t \vert s_t;\theta) }{\partial \theta} \cdot \left(Q_\pi (s_t,a_t)-V_\pi(s_t) \right)
$$

- 由于 $Q_\pi(s_t,a_t) = \mathbb{E}[U_t \vert s_t,a_t]$,我们可以进行蒙特卡洛近似$Q_\pi(s_t,a_t) \approx u_t$,最后我们需要求$u_t$

- 如何求$u_t$?我们玩一局游戏观测到轨迹$(s_t,a_t,r_t,s_{t+1},a_{t+1},r_{t+1},\dots,s_n,a_n,r_n$,然后计算return:$u_t = \sum_{i=t}^{n}r^{i-t} \cdot r_t$,而且$u_t$是对$Q_\pi(s_t,a_t)$的无偏估计

- 我们还差个$V_{\pi}(s_t)$,我们用神经网络来$V_{\pi}(s_t;\mathbf{w})$近似,于是策略梯度可以近似为:
$$
\frac{\partial V_{\pi}(s_t)}{\partial \theta} \approx \mathbf{g}(a_t) \approx \frac{\partial \ln \pi(a_t \vert s_t;\theta) }{\partial \theta} \cdot \left(u_t - v(s_t;\mathbf{w}) \right)
$$

- 总结下来我们用了三个蒙特卡洛近似:
$$
\frac{\partial V_{\pi}(s_t)}{\partial \theta} = \mathbf{g}(A_t) = \mathbb{E}_ {A \sim \pi(\cdot \vert s; \theta)}\left[\frac{\partial \ln \pi(A \vert s_t;\theta) }{\partial \theta} \ \left(Q_{\pi}(s_t,a_t) -V_\pi(s_t)\right)\right]
$$
用$a \sim \pi(\cdot \vert s_t)$去采样动作,这是第一次近似。
$$
\mathbf{g}(a_t) = \left[\frac{\partial \ln \pi(a_t \vert s_t;\theta) }{\partial \theta} \ \left(Q_{\pi}(s_t,a_t) -V_\pi(s_t)\right)\right]
$$
然后用$u_t$和$v(s_t;\mathbf{w})$去近似$Q_{\pi}(s_t,a_t) $和$V_\pi(s_t)$,这是第二三次近似:
$$
\mathbf{g}(a_t) \approx \frac{\partial \ln \pi(a_t \vert s_t;\theta) }{\partial \theta} \cdot \left(u_t - v(s_t;\mathbf{w}) \right)
$$

- 这么一来我们就有两个网络了:策略网络$ \pi(a \vert s)$和价值网络:$V(s;\mathbf{w})$,同样地也可以共享feature层的参数。

- 算法步骤

1. 我们玩一局游戏观测到轨迹$(s_t,a_t,r_t,s_{t+1},a_{t+1},r_{t+1},\dots,s_n,a_n,r_n$
2. 计算return:$u_t = \sum_{i=t}^{n}r^{i-t} \cdot r_t$ 和 $\delta_t = v(s_t;\mathbf{w}) - u_t$
3. 更新参数$\theta$和$\mathbf{w}$:
$$
\begin{aligned}
\theta &\gets \theta - \beta \cdot \delta_t \cdot \frac{\partial \ln \pi(a_t \vert s_t;\theta) }{\partial \theta}\\
\mathbf{w} &\gets \mathbf{w} - \alpha \cdot \delta_t \cdot \frac{\partial v(s_t;\mathbf{w})}{\partial \mathbf{w}}
\end{aligned}
$$

### 2. Actor Critic On-Policy

Expand Down

0 comments on commit f4f695e

Please sign in to comment.