-
-
Notifications
You must be signed in to change notification settings - Fork 212
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #2010 from priyashuu/q
added q learning
- Loading branch information
Showing
1 changed file
with
129 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,129 @@ | ||
--- | ||
id: q-learning | ||
title: Q Learning Algorithm | ||
sidebar_label: Q Learning | ||
description: "An overview of the Q-Learning Algorithm, a model-free reinforcement learning method that learns the optimal action-value function to guide decision-making." | ||
tags: [machine learning, reinforcement learning, q-learning, algorithms, model-free] | ||
--- | ||
|
||
### Definition: | ||
The **Q-Learning Algorithm** is a model-free reinforcement learning algorithm used to find the optimal action-selection policy for any given finite Markov Decision Process (MDP). It works by learning the value of actions in specific states without needing a model of the environment and aims to optimize long-term rewards. | ||
|
||
<Ads /> | ||
|
||
### Characteristics: | ||
- **Model-Free**: | ||
Q-Learning does not require prior knowledge of the environment's dynamics and learns directly from experience. | ||
|
||
- **Off-Policy**: | ||
The algorithm updates the value of actions using the maximum expected future rewards, regardless of the agent's current policy. | ||
|
||
- **Exploration vs. Exploitation**: | ||
Balances exploring new actions to find better long-term rewards and exploiting known actions to maximize immediate rewards, often managed using **epsilon-greedy** strategy. | ||
|
||
### How It Works: | ||
Q-Learning learns an **action-value function (Q-function)** that maps state-action pairs to their expected cumulative rewards. The Q-value is updated using the following equation: | ||
|
||
$$ | ||
Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right] | ||
$$ | ||
|
||
- $ s $: Current state | ||
- $ a $: Action taken in the current state | ||
- $ r $: Reward received after taking action $ a $ | ||
- $ s' $: Next state after taking action $ a $ | ||
- $\alpha $: Learning rate (controls how much new information overrides old) | ||
- $ \gamma $: Discount factor (determines the importance of future rewards) | ||
|
||
### Steps Involved: | ||
1. **Initialization**: | ||
Initialize the Q-table with zeros or random values for all possible state-action pairs. | ||
|
||
2. **Choose an Action**: | ||
Select an action using an exploration strategy (e.g., epsilon-greedy). | ||
|
||
3. **Take Action and Observe Reward**: | ||
Execute the selected action, transition to a new state, and receive the corresponding reward. | ||
|
||
4. **Update Q-Value**: | ||
Update the Q-value of the current state-action pair using the Q-Learning update rule. | ||
|
||
5. **Repeat**: | ||
Continue until the learning converges or a stopping condition is met. | ||
|
||
<Ads /> | ||
|
||
### Problem Statement: | ||
Given an environment defined by states and actions with unknown dynamics, the goal is to learn the optimal Q-function that allows an agent to make decisions maximizing cumulative rewards over time. | ||
|
||
### Key Concepts: | ||
- **Q-Table**: | ||
A matrix where each row represents a state, and each column represents an action. The values represent the learned Q-values for state-action pairs. | ||
|
||
- **Epsilon-Greedy Strategy**: | ||
A common method to balance exploration and exploitation. The agent selects a random action with probability $ \epsilon $ and the best-known action with probability $ 1 - \epsilon $. | ||
|
||
- **Convergence**: | ||
Q-learning converges to the optimal Q-function given an infinite number of episodes and a decaying learning rate. | ||
|
||
### Example: | ||
Consider a grid-world environment where an agent navigates to collect rewards: | ||
|
||
- **States**: Positions on the grid (e.g., (1,1), (1,2)) | ||
- **Actions**: Up, Down, Left, Right | ||
- **Rewards**: +1 for reaching the goal, -1 for hitting obstacles, 0 otherwise | ||
|
||
**Update Step**: | ||
After moving from (1,1) to (1,2) with action "Right" and receiving a reward of 0: | ||
|
||
$$ | ||
Q(1,1, \text{Right}) \leftarrow Q(1,1, \text{Right}) + \alpha \left[ 0 + \gamma \max_{a'} Q(1,2, a') - Q(1,1, \text{Right}) \right] | ||
$$ | ||
|
||
### Python Implementation: | ||
Here is a basic implementation of Q-Learning in Python: | ||
|
||
```python | ||
import numpy as np | ||
|
||
# Initialize Q-table with zeros | ||
q_table = np.zeros((state_space_size, action_space_size)) | ||
|
||
# Hyperparameters | ||
alpha = 0.1 # Learning rate | ||
gamma = 0.99 # Discount factor | ||
epsilon = 1.0 # Exploration rate | ||
epsilon_decay = 0.995 | ||
min_epsilon = 0.01 | ||
|
||
# Training loop | ||
for episode in range(num_episodes): | ||
state = env.reset() | ||
done = False | ||
|
||
while not done: | ||
# Choose action using epsilon-greedy strategy | ||
if np.random.rand() < epsilon: | ||
action = env.action_space.sample() # Explore | ||
else: | ||
action = np.argmax(q_table[state]) # Exploit | ||
|
||
# Take action and observe result | ||
next_state, reward, done, _ = env.step(action) | ||
|
||
# Update Q-value | ||
q_table[state, action] = q_table[state, action] + alpha * ( | ||
reward + gamma * np.max(q_table[next_state]) - q_table[state, action] | ||
) | ||
|
||
# Transition to next state | ||
state = next_state | ||
|
||
# Decay epsilon | ||
epsilon = max(min_epsilon, epsilon * epsilon_decay) | ||
|
||
print("Training completed.") | ||
``` | ||
|
||
### Conclusion: | ||
Q-learning is a powerful and foundational reinforcement learning technique that enables agents to learn optimal policies through direct interaction with an environment. Its simplicity and effectiveness make it a popular choice for many RL applications. |