From f75c120d2d65163f39b3c2e09b010769bae89225 Mon Sep 17 00:00:00 2001 From: Priya <150767072+priyashuu@users.noreply.github.com> Date: Sun, 10 Nov 2024 11:31:11 +0530 Subject: [PATCH 1/2] added q learning --- docs/machine-learning/qLearning.md | 128 +++++++++++++++++++++++++++++ 1 file changed, 128 insertions(+) create mode 100644 docs/machine-learning/qLearning.md diff --git a/docs/machine-learning/qLearning.md b/docs/machine-learning/qLearning.md new file mode 100644 index 000000000..095f156fb --- /dev/null +++ b/docs/machine-learning/qLearning.md @@ -0,0 +1,128 @@ +--- + +id: q-learning +title: Q-Learning Algorithm +sidebar_label: Q-Learning +description: "An overview of the Q-Learning Algorithm, a model-free reinforcement learning method that learns the optimal action-value function to guide decision-making." +tags: [machine learning, reinforcement learning, q-learning, algorithms, model-free] + +--- + +### Definition: +The **Q-Learning Algorithm** is a model-free reinforcement learning algorithm used to find the optimal action-selection policy for any given finite Markov Decision Process (MDP). It works by learning the value of actions in specific states without needing a model of the environment and aims to optimize long-term rewards. + +### Characteristics: +- **Model-Free**: + Q-Learning does not require prior knowledge of the environment's dynamics and learns directly from experience. + +- **Off-Policy**: + The algorithm updates the value of actions using the maximum expected future rewards, regardless of the agent's current policy. + +- **Exploration vs. Exploitation**: + Balances exploring new actions to find better long-term rewards and exploiting known actions to maximize immediate rewards, often managed using **epsilon-greedy** strategy. + +### How It Works: +Q-Learning learns an **action-value function (Q-function)** that maps state-action pairs to their expected cumulative rewards. The Q-value is updated using the following equation: + +\[ +Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right] +\] + +- **\( s \)**: Current state +- **\( a \)**: Action taken in the current state +- **\( r \)**: Reward received after taking action \( a \) +- **\( s' \)**: Next state after taking action \( a \) +- **\( \alpha \)**: Learning rate (controls how much new information overrides old) +- **\( \gamma \)**: Discount factor (determines the importance of future rewards) + +### Steps Involved: +1. **Initialization**: + Initialize the Q-table with zeros or random values for all possible state-action pairs. + +2. **Choose an Action**: + Select an action using an exploration strategy (e.g., epsilon-greedy). + +3. **Take Action and Observe Reward**: + Execute the selected action, transition to a new state, and receive the corresponding reward. + +4. **Update Q-Value**: + Update the Q-value of the current state-action pair using the Q-Learning update rule. + +5. **Repeat**: + Continue until the learning converges or a stopping condition is met. + +### Problem Statement: +Given an environment defined by states and actions with unknown dynamics, the goal is to learn the optimal Q-function that allows an agent to make decisions maximizing cumulative rewards over time. + +### Key Concepts: +- **Q-Table**: + A matrix where each row represents a state, and each column represents an action. The values represent the learned Q-values for state-action pairs. + +- **Epsilon-Greedy Strategy**: + A common method to balance exploration and exploitation. The agent selects a random action with probability \( \epsilon \) and the best-known action with probability \( 1 - \epsilon \). + +- **Convergence**: + Q-Learning converges to the optimal Q-function given an infinite number of episodes and a decaying learning rate. + +### Example: +Consider a grid-world environment where an agent navigates to collect rewards: + +- **States**: Positions on the grid (e.g., (1,1), (1,2)) +- **Actions**: Up, Down, Left, Right +- **Rewards**: +1 for reaching the goal, -1 for hitting obstacles, 0 otherwise + +**Update Step**: +After moving from (1,1) to (1,2) with action "Right" and receiving a reward of 0: +\[ +Q(1,1, \text{Right}) \leftarrow Q(1,1, \text{Right}) + \alpha \left[ 0 + \gamma \max_{a'} Q(1,2, a') - Q(1,1, \text{Right}) \right] +\] + +### Python Implementation: +Here is a basic implementation of Q-Learning in Python: + +```python +import numpy as np + +# Initialize Q-table with zeros +q_table = np.zeros((state_space_size, action_space_size)) + +# Hyperparameters +alpha = 0.1 # Learning rate +gamma = 0.99 # Discount factor +epsilon = 1.0 # Exploration rate +epsilon_decay = 0.995 +min_epsilon = 0.01 + +# Training loop +for episode in range(num_episodes): + state = env.reset() + done = False + + while not done: + # Choose action using epsilon-greedy strategy + if np.random.rand() < epsilon: + action = env.action_space.sample() # Explore + else: + action = np.argmax(q_table[state]) # Exploit + + # Take action and observe result + next_state, reward, done, _ = env.step(action) + + # Update Q-value + q_table[state, action] = q_table[state, action] + alpha * ( + reward + gamma * np.max(q_table[next_state]) - q_table[state, action] + ) + + # Transition to next state + state = next_state + + # Decay epsilon + epsilon = max(min_epsilon, epsilon * epsilon_decay) + +print("Training completed.") +``` + +### Conclusion: +Q-Learning is a powerful and foundational reinforcement learning technique that enables agents to learn optimal policies through direct interaction with an environment. Its simplicity and effectiveness make it a popular choice for many RL applications. + +--- From 5bb505ae26d90886812fec0d079153a1910f56be Mon Sep 17 00:00:00 2001 From: Ajay Dhangar <99037494+ajay-dhangar@users.noreply.github.com> Date: Sun, 10 Nov 2024 21:33:36 +0530 Subject: [PATCH 2/2] Update qLearning.md --- docs/machine-learning/qLearning.md | 45 +++++++++++++++--------------- 1 file changed, 23 insertions(+), 22 deletions(-) diff --git a/docs/machine-learning/qLearning.md b/docs/machine-learning/qLearning.md index 095f156fb..1a3be3b1f 100644 --- a/docs/machine-learning/qLearning.md +++ b/docs/machine-learning/qLearning.md @@ -1,16 +1,16 @@ --- - -id: q-learning -title: Q-Learning Algorithm -sidebar_label: Q-Learning -description: "An overview of the Q-Learning Algorithm, a model-free reinforcement learning method that learns the optimal action-value function to guide decision-making." -tags: [machine learning, reinforcement learning, q-learning, algorithms, model-free] - +id: q-learning +title: Q Learning Algorithm +sidebar_label: Q Learning +description: "An overview of the Q-Learning Algorithm, a model-free reinforcement learning method that learns the optimal action-value function to guide decision-making." +tags: [machine learning, reinforcement learning, q-learning, algorithms, model-free] --- ### Definition: The **Q-Learning Algorithm** is a model-free reinforcement learning algorithm used to find the optimal action-selection policy for any given finite Markov Decision Process (MDP). It works by learning the value of actions in specific states without needing a model of the environment and aims to optimize long-term rewards. + + ### Characteristics: - **Model-Free**: Q-Learning does not require prior knowledge of the environment's dynamics and learns directly from experience. @@ -24,16 +24,16 @@ The **Q-Learning Algorithm** is a model-free reinforcement learning algorithm us ### How It Works: Q-Learning learns an **action-value function (Q-function)** that maps state-action pairs to their expected cumulative rewards. The Q-value is updated using the following equation: -\[ +$$ Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right] -\] +$$ -- **\( s \)**: Current state -- **\( a \)**: Action taken in the current state -- **\( r \)**: Reward received after taking action \( a \) -- **\( s' \)**: Next state after taking action \( a \) -- **\( \alpha \)**: Learning rate (controls how much new information overrides old) -- **\( \gamma \)**: Discount factor (determines the importance of future rewards) +- $ s $: Current state +- $ a $: Action taken in the current state +- $ r $: Reward received after taking action $ a $ +- $ s' $: Next state after taking action $ a $ +- $\alpha $: Learning rate (controls how much new information overrides old) +- $ \gamma $: Discount factor (determines the importance of future rewards) ### Steps Involved: 1. **Initialization**: @@ -51,6 +51,8 @@ Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, 5. **Repeat**: Continue until the learning converges or a stopping condition is met. + + ### Problem Statement: Given an environment defined by states and actions with unknown dynamics, the goal is to learn the optimal Q-function that allows an agent to make decisions maximizing cumulative rewards over time. @@ -59,10 +61,10 @@ Given an environment defined by states and actions with unknown dynamics, the go A matrix where each row represents a state, and each column represents an action. The values represent the learned Q-values for state-action pairs. - **Epsilon-Greedy Strategy**: - A common method to balance exploration and exploitation. The agent selects a random action with probability \( \epsilon \) and the best-known action with probability \( 1 - \epsilon \). + A common method to balance exploration and exploitation. The agent selects a random action with probability $ \epsilon $ and the best-known action with probability $ 1 - \epsilon $. - **Convergence**: - Q-Learning converges to the optimal Q-function given an infinite number of episodes and a decaying learning rate. + Q-learning converges to the optimal Q-function given an infinite number of episodes and a decaying learning rate. ### Example: Consider a grid-world environment where an agent navigates to collect rewards: @@ -73,9 +75,10 @@ Consider a grid-world environment where an agent navigates to collect rewards: **Update Step**: After moving from (1,1) to (1,2) with action "Right" and receiving a reward of 0: -\[ + +$$ Q(1,1, \text{Right}) \leftarrow Q(1,1, \text{Right}) + \alpha \left[ 0 + \gamma \max_{a'} Q(1,2, a') - Q(1,1, \text{Right}) \right] -\] +$$ ### Python Implementation: Here is a basic implementation of Q-Learning in Python: @@ -123,6 +126,4 @@ print("Training completed.") ``` ### Conclusion: -Q-Learning is a powerful and foundational reinforcement learning technique that enables agents to learn optimal policies through direct interaction with an environment. Its simplicity and effectiveness make it a popular choice for many RL applications. - ---- +Q-learning is a powerful and foundational reinforcement learning technique that enables agents to learn optimal policies through direct interaction with an environment. Its simplicity and effectiveness make it a popular choice for many RL applications.