Skip to content

Q-Learning applied to Gymnasium's Toy Text environments: FrozenLake, CliffWalking, BlackJack, and Taxi.

Notifications You must be signed in to change notification settings

phamduyaaaa/Play-All-ToyText-with-Q-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎮 Play-All-ToyText with Q-Learning

Welcome to the Play-All-ToyText with Q-Learning project! 🚀
In this project, I've applied the Q-Learning algorithm to solve problems in popular ToyText environments like FrozenLake, CliffWalking, Blackjack, and Taxi.

The goal is to train agents using Q-Learning to optimize policies and maximize rewards in these environments.


🔍 Introduction

In ToyText environments, agents learn to take actions by maximizing rewards in games such as:

  • FrozenLake 🌊
  • CliffWalking 🧗‍♂️
  • BlackJack 🃏
  • Taxi 🚕

The objective of this project is to apply the Q-Learning algorithm to optimize agents' policies and achieve the highest possible rewards.


Q-learning Update Rule

The Q-learning update for the Q-table is expressed as:

$$ Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right] $$


Explanation of Terms

  • $Q(s, a)$: The current Q-value for performing action $a$ in state $s$.
  • $alpha$ $(Learning Rate)$: A scalar that controls how much the new information influences the update. Values range from $0$ to $1$.
  • $r$ $(Reward)$: The immediate reward received after performing action $a$ in state $s$.
  • $gamma$ $(Discount Factor)$: A scalar between $0$ and $1$ that determines the importance of future rewards. A higher $gamma$ emphasizes long-term rewards.
  • $max_{a'}$ $Q(s', a')$: The maximum Q-value for the next state $s'$ across all possible actions $a'$.
  • $(s, a)$: The current state and action pair.
  • $(s', a')$: The next state and the set of possible actions.

Key Concepts

  1. Temporal Difference (TD) Error: The difference between the expected Q-value and the current Q-value:

    $TD$ $Error$ $=$ $r$ + $\gamma \max_{a'}$ $Q(s', a')$ - $Q(s, a)$

  2. Q-value Update: The Q-value for the current state-action pair $(s, a)$ is updated using the TD error, scaled by the learning rate $(alpha)$. This balances learning from new experiences versus relying on existing knowledge.

  3. Learning Dynamics:

    • The update incorporates both the immediate reward $r$ and the discounted future rewards $gamma$ $max_{a'}$ $Q(s', a')$.
    • Over time, the Q-table converges to optimal values, assuming sufficient exploration and a properly tuned learning rate.

🌐 Environments

Environment Demo Plot (Results)
FrozenLake-v1

CliffWalking-v0

BlackJack-v1

Taxi-v3


About

Q-Learning applied to Gymnasium's Toy Text environments: FrozenLake, CliffWalking, BlackJack, and Taxi.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages