A pedagogical Proximal Policy Optimization (PPO) project applied to the classic Snake game.
This repository demonstrates how an agent, based on the PPO algorithm, learns to play Snake by collecting food and avoiding collisions. Over time, the agent discovers a strategy of quickly circling around the reward before taking it to minimize self-collisions, especially since it does not precisely track its entire tail position.
- The environment is implemented in
Game/Snake.py
. - The PPO agent logic is in
Agent/PPO.py
. - Usage examples and training procedure are shown in
main.ipynb
.
- Clone this repository.
- Install dependencies:
pip install -r requirements.txt
- You can then run or modify
main.ipynb
to train or test the PPO agent.
PPO is a policy gradient method designed to stabilize training by limiting updates to the policy. In this project:
- We use clipping (controlled by
"clip_epsilon"
) to avoid overly large updates. - We incorporate entropy regularization (controlled by
"entropy_coef"
) to encourage exploration. - We apply Generalized Advantage Estimation (GAE) for more stable advantage computation.
The core training code is in the train
function of PPOAgent
, and the environment loops in SnakeEnv
.
Below is a demo of the trained agent playing, occasionally circling around the reward to avoid self-collisions (it doesn't know its tail's exact position, see main.ipynb
for the observation space details):
The agent’s reward curve increases as it masters collecting food. Episode length first rises with better survival but eventually decreases when it chooses to sacrifice longevity for quicker gains:
- Train the agent by running:
# Inside main.ipynb agent.train(total_epochs=10000, steps_per_epoch=4096)
- Test the trained agent (with optional rendering):
agent.test_episode(render=True)
Explore main.ipynb
for more details on experiments, and review the in-code comments for deeper understanding.