Update groot doc

CraftJarvis · Dec 12, 2024 · 696cd9b · 696cd9b
1 parent 37fe3a4
commit 696cd9b
Show file tree

Hide file tree

Showing 5 changed files with 67 additions and 4 deletions.
diff --git a/docs/source/_static/image/groot_architecture.png b/docs/source/_static/image/groot_architecture.png
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -1,7 +1,7 @@
 '''
 Date: 2024-11-28 17:46:44
-LastEditors: caishaofei caishaofei@stu.pku.edu.cn
-LastEditTime: 2024-11-30 13:37:48
+LastEditors: muzhancun muzhancun@126.com
+LastEditTime: 2024-12-12 21:27:17
 FilePath: /MineStudio/docs/source/conf.py
 '''
 import os
@@ -80,6 +80,11 @@
             "url": "https://pypi.org/project/minestudio",
             "icon": "fa-custom fa-pypi",
         },
+        {
+            "name": "ArXiv",
+            "url": "https://arxiv.org/pdf/2310.08235",
+            "icon": "ai ai-arxiv",
+        }
   ], 
   "navbar_align": "left",
   "show_toc_level": 1,

diff --git a/docs/source/models/baseline-groot.md b/docs/source/models/baseline-groot.md
diff --git a/docs/source/models/baseline-groot.rst b/docs/source/models/baseline-groot.rst
@@ -0,0 +1,48 @@
+GROOT: Learning to Follow Instructions by Watching Gameplay Videos
+======================================================================
+
+.. admonition:: Quick Facts
+
+    - GROOT is an open-world controller that follows open-ended instructions by using reference videos as expressive goal specifications, eliminating the need for text annotations. 
+    - GROOT leverages a causal transformer-based encoder-decoder architecture to self-supervise the learning of a structured goal space from gameplay videos.
+
+Insights
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To develop an effective instruction-following controller, defining a robust goal representation is essential. Unlike previous approaches, such as using language descriptions or future images (e.g., Steve-1), GROOT leverages reference videos as goal representations. These gameplay videos serve as a rich and expressive source of information, enabling the agent to learn complex behaviors effectively. The paper frames the learning process as future state prediction, allowing the agent to follow demonstrations seamlessly.
+
+Method
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Formally, the future state prediction problem is defined as maximizing the log-likelihood of future states given past ones: :math:\log p_{\theta}(s_{t+1:T} | s_{0:t}). By introducing :math:g as a latent variable conditioned on past states, the evidence lower bound (ELBO) can be expressed as:
+
+.. math::
+
+    \log p_{\theta}(s_{t+1:T} | s_{0:t}) &= \log \sum_g p_{\theta}(s_{t+1:T}, g | s_{0:t}) \\
+    &\geq \mathbb{E}_{g \sim q_\phi(\cdot | s_{0:t})} \left[ \log p_{\theta}(s_{t+1:T} | g, s_{0:t}) \right] - D_{\text{KL}}(q_\phi(g | s_{0:T}) || p_\theta(g|s_{0:t})),
+
+where :math:`D_{\text{KL}}` is the Kullback-Leibler divergence, and :math:`q_\phi` represents the variational posterior.
+
+This objective can be further simplified using the transition function :math:`p_{\theta}(s_{t+1}|s_{0:t},a_t)` and a goal-conditioned policy (to be learned) :math:`\pi_{\theta}(a_t|s_{0:t},g)`:
+
+.. math::
+
+    \log p_{\theta}(s_{t+1:T} | s_{0:t}) \geq \sum_{\tau = t}^{T - 1} \mathbb{E}_{g \sim q_\phi(\cdot | s_{0:T}), a_\tau \sim p_{\theta}(\cdot | s_{0:\tau+1})} \left[ \log \pi_{\theta}(a_{\tau} | s_{0:\tau}, g) \right] - D_{\text{KL}}(q_\phi(g | s_{0:T}) || p_\theta(g|s_{0:t})),
+
+where :math:`q_\phi(\cdot|s_{0:T})` is implemented as a video encoder, and :math:`p_{\theta}(\cdot|s_{0:\tau+1})` represents the Inverse Dynamic Model (IDM), which predicts actions to transition to the next state and is typically a pretrained model.
+Please refer to the `paper <https://arxiv.org/pdf/2310.08235>`_ for more details.
+
+Architecture
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. figure:: ../_static/image/groot_architecture.png
+    :width: 800
+    :align: center
+
+    GROOT agent architecture.
+
+The GROOT agent consists of a video encoder and a policy decoder.
+The video encoder is a non-causal transformer that extracts semantic information and generates goal embeddings.
+The policy is a causal transformer
+decoder that receives the goal embeddings as the instruction and autoregressively translates the state
+sequence into a sequence of actions.
diff --git a/docs/source/models/index.md b/docs/source/models/index.md
@@ -1,13 +1,23 @@
 <!--
  * @Date: 2024-12-03 04:47:37
- * @LastEditors: caishaofei caishaofei@stu.pku.edu.cn
- * @LastEditTime: 2024-12-03 06:27:07
+ * @LastEditors: muzhancun muzhancun@126.com
+ * @LastEditTime: 2024-12-12 20:31:15
  * @FilePath: /MineStudio/docs/source/models/index.md
 -->
 # Models
 
 We provided a template for the Minecraft Policy and based on this template, we created various different baseline models, including VPT, STEVE-1, GROOT, ROCKET-1, etc. 
 
+
+```{toctree}
+:caption: MineStudio Models
+
+baseline-vpt
+baseline-steve1
+baseline-groot
+baseline-rocket1
+```
+
 ## Quick Start
 ```{include} quick-models.md
 ```