General writing style changes and error fixes.

RLGym · Jan 7, 2025 · cf36c2f · cf36c2f
1 parent be28518
commit cf36c2f
Show file tree

Hide file tree

Showing 15 changed files with 155 additions and 147 deletions.
diff --git a/docs/Cheatsheets/game_values.md b/docs/Cheatsheets/game_values.md
@@ -5,17 +5,17 @@ sidebar_position: 2
 
 # Game Values
 
-All lengths are measured in Unreal Units (uu), equivalent to 1 centimeter.
+This document provides a reference for constant values used in Rocket League. All spatial measurements are in Unreal Units (uu), where 1 uu = 1 centimeter.
 
 ## Field Dimensions
 | Name | Type | Range | Description |
 |------|------|-------|-------------|
-| `SIDE_WALL_X` | float | ±4096 | Distance from center to side wall |
-| `BACK_WALL_Y` | float | ±5120 | Distance from center to back wall |
-| `CEILING_Z` | float | 2044 | Height of the ceiling |
-| `BACK_NET_Y` | float | ±6000 | Distance from center to back of net |
+| `SIDE_WALL_X` | float | ±4096 | Distance from field center to side wall |
+| `BACK_WALL_Y` | float | ±5120 | Distance from field center to back wall |
+| `CEILING_Z` | float | 2044 | Height of arena ceiling from ground |
+| `BACK_NET_Y` | float | ±6000 | Distance from field center to back of net |
 | `CORNER_CATHETUS_LENGTH` | float | 1152 | Length of corner wall section |
-| `RAMP_HEIGHT` | float | 256 | Height of corner ramp |
+| `RAMP_HEIGHT` | float | 256 | Height of corner ramp from ground |
 
 ## Goal Dimensions
 | Name | Type | Range | Description |
@@ -26,11 +26,11 @@ All lengths are measured in Unreal Units (uu), equivalent to 1 centimeter.
 ## Time Values
 | Name | Type | Range | Description |
 |------|------|-------|-------------|
-| `TICKS_PER_SECOND` | int | 120 | Physics updates per second |
-| `SMALL_PAD_RECHARGE` | float | 4.0 | Small boost pad respawn time |
-| `BIG_PAD_RECHARGE` | float | 10.0 | Large boost pad respawn time |
-| `DEMO_RESPAWN_SECONDS` | float | 3.0 | Time to respawn after demo |
-| `BOOST_CONSUMPTION_RATE` | float | 33.3 | Boost used per second |
+| `TICKS_PER_SECOND` | int | 120 | Physics simulation rate |
+| `SMALL_PAD_RECHARGE` | float | 4.0 | Small boost pad respawn time (seconds) |
+| `BIG_PAD_RECHARGE` | float | 10.0 | Large boost pad respawn time (seconds) |
+| `DEMO_RESPAWN_SECONDS` | float | 3.0 | Car respawn time after demolition |
+| `BOOST_CONSUMPTION_RATE` | float | 33.3 | Boost consumption per second |
 
 ## Physics Values
 | Name | Type | Range | Description |
@@ -44,10 +44,10 @@ All lengths are measured in Unreal Units (uu), equivalent to 1 centimeter.
 ## Speed Limits
 | Name | Type | Range | Description |
 |------|------|-------|-------------|
-| `BALL_MAX_SPEED` | float | 6000 | Maximum ball speed |
-| `CAR_MAX_SPEED` | float | 2300 | Maximum car speed |
-| `SUPERSONIC_THRESHOLD` | float | 2200 | Speed for supersonic trail |
-| `CAR_MAX_ANG_VEL` | float | 5.5 | Max angular velocity (rad/s) |
+| `BALL_MAX_SPEED` | float | 6000 | Maximum attainable ball velocity |
+| `CAR_MAX_SPEED` | float | 2300 | Maximum attainable car velocity |
+| `SUPERSONIC_THRESHOLD` | float | 2200 | Velocity threshold for supersonic state |
+| `CAR_MAX_ANG_VEL` | float | 5.5 | Maximum car angular velocity (radians/s) |
 
 ## Teams
 | Name | Type | Range | Description |

diff --git a/docs/Cheatsheets/reinforcement_learning_terms.md b/docs/Cheatsheets/reinforcement_learning_terms.md
@@ -4,42 +4,47 @@ sidebar_position: 1
 ---
 
 # Reinforcement Learning Background
+
 What follows is a series of definitions that may be useful to understand the concepts of reinforcement learning that are used when training an agent in an RLGym environment. Note that these definitions are not meant to be exhaustive, and we will formulate the reinforcement learning setting in a somewhat non-standard way to better align with the environments typically considered by practitioners using RLGym.
 
-## The Basics
 
-A *decision process* $\mathcal{P}$, sometimes called an *environment*, is defined by a set of *states* $\mathcal{S}$, a set of *actions* $\mathcal{A}$, and a state transition probability function $\mathcal{T}(s' | s, a)$.
+## The Basics
 
-An *action* $a \in \mathcal{A}$ is a decision made by an agent in a given state $s \in \mathcal{S}$.
+A *decision process* $\mathcal{P}$, sometimes called an *environment*, is characterized by a set of *states* $\mathcal{S}$, a set of *actions* $\mathcal{A}$, and a state transition probability function $\mathcal{T}(s' | s, a)$.
 
-Upon receiving an action $a$ at a state $s$, the decision process continues by sampling a new state $s'$ from the state transition probability function $\mathcal{T}(s' | s, a)$.
+When an agent executes an *action* $a \in \mathcal{A}$ in state $s \in \mathcal{S}$, the environment transitions to a new state $s'$ according to $\mathcal{T}(s' | s, a)$.
 
 For our purposes it is useful to further consider a set of *observations* $\mathcal{O}$, which are representations of states that an agent acts upon. The observation function $\mathbf{O} : \mathcal{S} \rightarrow \mathcal{O}$ maps states $s$ to observations $o$.
 
-A policy $\pi : \mathbb{R}^{|\mathcal{O}|} \rightarrow \mathbb{R}^{|\mathcal{A}|}$ is a function that maps an observation $o$ to a real-valued vector, $\pi(o) \in \mathbb{R}^{|\mathcal{A}|}$.
+A *policy* $\pi : \mathbb{R}^{|\mathcal{O}|} \rightarrow \mathbb{R}^{|\mathcal{A}|}$ is a function that maps an observation $o$ to a real-valued vector, $\pi(o) \in \mathbb{R}^{|\mathcal{A}|}$.
+
+An *action function* $\mathbf{I} : \mathbb{R}^{|\mathcal{A}|} \rightarrow \mathcal{A}$ is a function that maps the output of a policy to an action.
 
-An action function $\mathbf{I} : \mathbb{R}^{|\mathcal{A}|} \rightarrow \mathcal{A}$ is a function that maps the output of a policy to an action.
+The complete state-to-action mapping is given by $a = \mathbf{I}(\pi(\mathbf{O}(s)))$. We will refer to this mapping as an *agent*.
 
-An agent is a function that maps a state $s$ to an action $a$ according to a policy $\pi$. That is, an agent is the combination of an observation function $\mathbf{O}$, a policy $\pi$, and an action function $a = \mathbf{I}(\pi(\mathbf{O}(s)))$.
+Following each action $a$ in state $s$, a *reward function* $\mathbf{R} : \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ generates a scalar reward $r$.
 
-After receiving an action $a$ at a state $s$, but before a new state $s'$ is sampled, a reward $r$ is given by a *reward function* $\mathbf{R} : \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$.
+We refer to a single interaction between the agent and the environment as a *timestep*, which contains $(s, a, r, s')$.
 
-An *episode* is a sequence of states, actions, and rewards that an agent follows from a given initial state $s_0$ to a terminal state $s_T$.
+## Trajectories and Returns
 
-A *trajectory* is a generalization of an episode as a sequence of states, actions, and rewards that begin with some state $s_t$ and end at some other state $s_{t+n}$. Note that trajectories do not need to begin with an initial state $s_0$, and do not need to end with a terminal state $s_T$. They can also be of infinite length.
+We are concerned with two types of sequences:
+- An *episode*: A complete sequence of timesteps starting from an initial state $s_0$ and ending with a terminal state $s_T$
+- A *trajectory*: Any sequence of timesteps from some arbitrary state $s_t$ to $s_{t+n}$
+If all sequences of actions from any $s_t$ are guaranteed to eventually reach some terminal state $s_T$, we refer to this as a *finite-horizon problem*. If instead we allow the trajectory to continue indefinitely, we refer to this as an *infinite-horizon problem*.
 
-A *return* $G$ is the sum of rewards obtained by an agent over a trajectory. In the undiscounted episodic setting, a return is defined by $G_t = \sum_{t=1}^{t=T} r_t$. However, in the more general setting there may be no terminal state $s_T$. 
+A *return* $G$ is the cumulative reward obtained over a trajectory. In the finite-horizon case, the return is defined as $G_t = \sum_{t=1}^{t=T} r_t$. For infinite-horizon problems, we must introduce a *discount factor* $\gamma \in [0, 1)$ to form the *discounted return* $G_t = \sum_{t=1}^{t=\infty} \gamma^{t-1} r_t$.
 
-As such, we typically consider *discounted returns* $G_t = \sum_{t=1}^{t=inf} \gamma^{t-1} r_t$, where $\gamma \in [0, 1)$ is the *discount factor*.
+The discount factor $\gamma$ serves two purposes. First, it ensures convergence of $G_t$ as $t \rightarrow \infty$ by forming a convergent geometric series when $|r_t|$ is bounded. Second, it acts as a from of temporal *credit assignment* by assigning more weight to rewards that were obtained closer to time $t$. 
 
-The discount factor $\gamma$ works to ensure that $G_t$ converges in the limit as $t \rightarrow \infty$. This is because $\sum_{t=0}^{\infty} \gamma^t$ is a converging geometric series, so if the maximum value of $|r_t|$ is finite, then $G_t$ is also finite. Further, $\gamma$ works as a form of *credit assignment*, where actions that are taken closer to time $t$ have a greater impact on the return $G_t$ (this is because $\gamma^t$ gets smaller as $t$ gets larger). Another way to think of this is that actions taken farther into the future are less important than the actions we just took.
+## Value Functions
 
-The *state value function*, often just called the *value function* $V : \mathcal{S} \rightarrow \mathbb{R}$ is a function that maps states to the *expected return* of a policy at that state. It is defined by $V(s_t) = \mathbb{E}_{\pi}[G_t | s_t]$. This is an important quantity to understand because it captures the *quality* of a policy at a given state. It should be emphasized that the value function considers only one specific policy, so every time we make even a tiny change to our agent's policy, the value function will change as well. One way to envision the value function is to imagine the agent being dropped into the game at some arbitrary state $s$. The *value* of the policy at that state is the *expected return* that the agent will get if they play the game according to that policy forever, starting at that state and stopping if a terminal state is ever reached.
+The *state value function*, often just called the *value function* $V : \mathcal{S} \rightarrow \mathbb{R}$ is a function that maps states to the *expected return* of a policy at that state. It is given by $V(s_t) = \mathbb{E}_{\pi}[G_t | s_t]$. This is an important quantity to understand because it captures the *quality* of a policy at a given state. It should be emphasized that the value function considers only one specific policy, so every time we make even a tiny change to our agent's policy, the value function will change as well. One way to envision the value function is to imagine the agent being dropped into the game at some arbitrary state $s$. The *value* of the policy at that state is the return it would get *on average* if we allowed it to play from that state infinitely many times, restarting from the same state each time a terminal state is reached.
 
-The *state-action value function*, or *Q function* $Q : \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ is a function that maps states and actions to the *expected return* of a policy at a state $s$ when the agent takes action $a$ at that state, then acts according to $pi$ forever afterwards. It is defined by $Q(s_t, a_t) = \mathbb{E}_{\pi}[G_t | s_t, a_t]$. This quantity is similar to $V(s)$, but with the caveat that the agent must first take the action $a_t$ at state $s_t$ before acting according to the policy $\pi$ forever afterwards. Note that $Q(s, a)$ can be written in terms of $V(s)$ as $Q(s, a) = V(s) + r + \gamma V(s')$.
+The *state-action value function*, or *Q function* $Q : \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ is a function that maps states and actions to the *expected return* of a policy at a state $s$ when the agent takes action $a$ at that state, then acts according to $\pi$ forever afterwards. It is given by $Q(s_t, a_t) = \mathbb{E}_{\pi}[G_t | s_t, a_t]$. This quantity is similar to $V(s)$, but with the caveat that the agent must first take the action $a_t$ at state $s_t$ before acting according to the policy $\pi$ forever afterwards. Note that $Q(s, a)$ can be written in terms of $V(s)$ as $Q(s, a) = V(s) + r + \gamma V(s')$.
 
-A third useful quantity is the *state-action advantage function*, or *advantage function* $A : \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$, is the difference between the Q function and the value function at a timestep. This is given by $A(s_t, a_t) = Q(s_t, a_t) - V(s_t)$. Think of the advantage function as a measure of how much better it was to take the action $a_t$ at state $s_t$ than it would have been to just act according to the policy $\pi$ at that state.
+The *state-action advantage function*, or *advantage function* $A : \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$, is the difference between the Q function and the value function at a state given an action. This is given by $A(s_t, a_t) = Q(s_t, a_t) - V(s_t)$. Think of the advantage function as a measure of how much better it was to take the action $a_t$ at state $s_t$ than it would have been to just act according to the policy $\pi$ at that state.
 
-## How Agents Learn
+## Learning Process
 
-Most learning algorithms consider an *objective function* $J(\pi)$, which is a function that maps a policy $\pi$ to a real number. The goal of learning is then to find a policy $\pi^*$ that maximizes the objective function, i.e. $J(\pi^*) = \max_{\pi} J(\pi)$. A convenient choice for $J$ would be any of the Q function, value function, or advantage function. For our purposes, we will focus on the advantage function, because the Proximal Policy Optimization (PPO) algorithm uses it as the objective.
+Most learning algorithms consider an *objective function* $J(\pi)$, which is a function that maps a policy $\pi$ to a real number. The goal of learning is then to find a policy $\pi^*$ that maximizes the objective function, i.e. $J(\pi^*) = \max_{\pi} J(\pi)$. A convenient choice for $J$ would be any of the Q function, value function, or advantage function. For our purposes we will focus on the advantage function, because the Proximal Policy Optimization (PPO) algorithm uses that as an  bjective.
diff --git a/docs/Custom Environments/custom-environment.md b/docs/Custom Environments/custom-environment.md
@@ -5,11 +5,11 @@ sidebar_position: 2
 
 # Creating an Environment
 
-This tutorial will show you how to create a simple grid world environment using the RLGym API. Every RLGym environment must implement the configuration objects specified in our [RLGym overview](/Getting%20Started/overview). Let's take a look at the following example to see how each of the configuration objects can be implemented and used.
+This tutorial demonstrates how to implement a grid world environment using the RLGym API. Each RLGym environment requires implementing the configuration objects described in the [RLGym overview](/Getting%20Started/overview). The following example illustrates an implementation of each required component.
 
 ## Grid World Example
 
-We'll start with the transition engine, which defines the states environment dynamics.
+We begin by defining the state of our environment, and a transition engine that handles the environment dynamics.
 
 ```python
 from typing import Dict, List, Tuple, Optional
@@ -89,7 +89,8 @@ class GridWorldEngine(TransitionEngine[int, GridWorldState, int]):
         self._state = initial_state if initial_state is not None else self.create_base_state()
 ```
 
-Now that the hard part is out of the way, we can implement the other configuration objects that will shape what agents see, the actions they can take, and how they should be rewarded.
+Now we implement the remaining configuration objects for our environment.
+
 ```python
 # We need to define a state mutator, which is responsible for modifying the environment state.
 class GridWorldMutator(StateMutator[GridWorldState]):
@@ -197,7 +198,7 @@ class GridWorldTruncatedCondition(DoneCondition[int, GridWorldState]):
         return state.steps >= self.max_steps
 
 ```
-And that's it for our configuration objects! To use our new gridworld environment, we just pass the configuration objects to the RLGym constructor.
+With all configuration objects implemented, we can construct the environment by passing an instance of each object to the RLGym constructor.
 
 ```python
 # Build the environment
@@ -225,4 +226,4 @@ for _ in range(1000):
         print(f"Episode reward: {ep_rew}")
         ep_rew = 0
 ```
-Now we are ready to plug this into a learning algorithm and train a gridworld agent!
+The environment is now ready for integration with a learning algorithm to train a grid world agent.
diff --git a/docs/Getting Started/introduction.md b/docs/Getting Started/introduction.md
@@ -7,22 +7,16 @@ sidebar_position: 1
 
 ## What is RLGym?
 
-RLGym is a Python API for creating reinforcement learning environments. While it was originally designed for the game [Rocket League](https://www.rocketleague.com), the core API is now game-agnostic. This means you can use RLGym to create any kind of environment you want, from simple grid worlds to complex physics simulations. Get an overview of the API in our [overview](/Getting%20Started/overview) section.
-
-## How it Works
-RLGym provides a simple API for creating fully customizable environments for reinforcement learning projects. Each environment is built from a few core components, which we refer to as "configuration objects". When provided with a set of configuration objects, RLGym will handle the flow of information throughout the environment, and provide a simple interface for learning agents to interact with.
+RLGym is a Python API for creating reinforcement learning environments. While it was originally designed for the game [Rocket League](https://www.rocketleague.com), the core API is now game-agnostic. This means you can use RLGym to create environments from simple grid worlds to complex physics simulations. Check out the [overview](/Getting%20Started/overview) section for a detailed overview of the RLGym API.
 
 ## Getting Started
 
-The most developed use of RLGym is for Rocket League. We provide a complete environment implementation that allows users to train agents with [RocketSim](https://github.com/ZealanL/rocketsim), a headless simulator for Rocket League. Users can customize every aspect of the environment by implementing their own [Configuration Objects](/Getting%20Started/overview/), or use the default implementations provided by RLGym. Head over to our [Quickstart Guide](quickstart.md) if you want to jump right in to training a Rocket League agent, or check out our [Custom Environments](../../Custom%20Environments/custom-environment) section for a step-by-step guide to creating your own environment with the RLGym API.
+RLGym provides an implementation for Rocket League that uses [RocketSim](https://github.com/ZealanL/rocketsim) as a headless simulator. You can use the default settings or customize the environment by implementing your own [configuration objects](/Getting%20Started/overview/). Take a look at the [Quickstart Guide](quickstart.md) to train your first Rocket League agent, or check out the [Custom Environments](../../Custom%20Environments/custom-environment) section to see an example of how to create your own environment.
 
 ## Installation
-RLGym is split into several packages to keep things modular and lightweight. The core API package has no dependencies, while additional packages provide specific functionality:
+RLGym is split into sub-packages to keep things simple. The core API has no dependencies, and you can add extra features through these sub-packages:
 
 ```bash
-# Installs every rlgym component
-pip install rlgym[all]  
-
 # Installs only the api
 pip install rlgym  
 
@@ -34,4 +28,7 @@ pip install rlgym[rl-sim]
 
 # Installs RLViser and RocketSim rocket league packages
 pip install rlgym[rl-rlviser]  
+
+# Installs every rlgym component
+pip install rlgym[all] 
 ```