Skip to content

Commit

Permalink
Merge pull request #2010 from priyashuu/q
Browse files Browse the repository at this point in the history
added  q learning
  • Loading branch information
ajay-dhangar authored Nov 10, 2024
2 parents 0d1cb59 + 5bb505a commit 52de0ac
Showing 1 changed file with 129 additions and 0 deletions.
129 changes: 129 additions & 0 deletions docs/machine-learning/qLearning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
---
id: q-learning
title: Q Learning Algorithm
sidebar_label: Q Learning
description: "An overview of the Q-Learning Algorithm, a model-free reinforcement learning method that learns the optimal action-value function to guide decision-making."
tags: [machine learning, reinforcement learning, q-learning, algorithms, model-free]
---

### Definition:
The **Q-Learning Algorithm** is a model-free reinforcement learning algorithm used to find the optimal action-selection policy for any given finite Markov Decision Process (MDP). It works by learning the value of actions in specific states without needing a model of the environment and aims to optimize long-term rewards.

<Ads />

### Characteristics:
- **Model-Free**:
Q-Learning does not require prior knowledge of the environment's dynamics and learns directly from experience.

- **Off-Policy**:
The algorithm updates the value of actions using the maximum expected future rewards, regardless of the agent's current policy.

- **Exploration vs. Exploitation**:
Balances exploring new actions to find better long-term rewards and exploiting known actions to maximize immediate rewards, often managed using **epsilon-greedy** strategy.

### How It Works:
Q-Learning learns an **action-value function (Q-function)** that maps state-action pairs to their expected cumulative rewards. The Q-value is updated using the following equation:

$$
Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]
$$

- $ s $: Current state
- $ a $: Action taken in the current state
- $ r $: Reward received after taking action $ a $
- $ s' $: Next state after taking action $ a $
- $\alpha $: Learning rate (controls how much new information overrides old)
- $ \gamma $: Discount factor (determines the importance of future rewards)

### Steps Involved:
1. **Initialization**:
Initialize the Q-table with zeros or random values for all possible state-action pairs.

2. **Choose an Action**:
Select an action using an exploration strategy (e.g., epsilon-greedy).

3. **Take Action and Observe Reward**:
Execute the selected action, transition to a new state, and receive the corresponding reward.

4. **Update Q-Value**:
Update the Q-value of the current state-action pair using the Q-Learning update rule.

5. **Repeat**:
Continue until the learning converges or a stopping condition is met.

<Ads />

### Problem Statement:
Given an environment defined by states and actions with unknown dynamics, the goal is to learn the optimal Q-function that allows an agent to make decisions maximizing cumulative rewards over time.

### Key Concepts:
- **Q-Table**:
A matrix where each row represents a state, and each column represents an action. The values represent the learned Q-values for state-action pairs.

- **Epsilon-Greedy Strategy**:
A common method to balance exploration and exploitation. The agent selects a random action with probability $ \epsilon $ and the best-known action with probability $ 1 - \epsilon $.

- **Convergence**:
Q-learning converges to the optimal Q-function given an infinite number of episodes and a decaying learning rate.

### Example:
Consider a grid-world environment where an agent navigates to collect rewards:

- **States**: Positions on the grid (e.g., (1,1), (1,2))
- **Actions**: Up, Down, Left, Right
- **Rewards**: +1 for reaching the goal, -1 for hitting obstacles, 0 otherwise

**Update Step**:
After moving from (1,1) to (1,2) with action "Right" and receiving a reward of 0:

$$
Q(1,1, \text{Right}) \leftarrow Q(1,1, \text{Right}) + \alpha \left[ 0 + \gamma \max_{a'} Q(1,2, a') - Q(1,1, \text{Right}) \right]
$$

### Python Implementation:
Here is a basic implementation of Q-Learning in Python:

```python
import numpy as np

# Initialize Q-table with zeros
q_table = np.zeros((state_space_size, action_space_size))

# Hyperparameters
alpha = 0.1 # Learning rate
gamma = 0.99 # Discount factor
epsilon = 1.0 # Exploration rate
epsilon_decay = 0.995
min_epsilon = 0.01

# Training loop
for episode in range(num_episodes):
state = env.reset()
done = False

while not done:
# Choose action using epsilon-greedy strategy
if np.random.rand() < epsilon:
action = env.action_space.sample() # Explore
else:
action = np.argmax(q_table[state]) # Exploit

# Take action and observe result
next_state, reward, done, _ = env.step(action)

# Update Q-value
q_table[state, action] = q_table[state, action] + alpha * (
reward + gamma * np.max(q_table[next_state]) - q_table[state, action]
)

# Transition to next state
state = next_state

# Decay epsilon
epsilon = max(min_epsilon, epsilon * epsilon_decay)

print("Training completed.")
```

### Conclusion:
Q-learning is a powerful and foundational reinforcement learning technique that enables agents to learn optimal policies through direct interaction with an environment. Its simplicity and effectiveness make it a popular choice for many RL applications.

0 comments on commit 52de0ac

Please sign in to comment.