From f75c120d2d65163f39b3c2e09b010769bae89225 Mon Sep 17 00:00:00 2001
From: Priya <150767072+priyashuu@users.noreply.github.com>
Date: Sun, 10 Nov 2024 11:31:11 +0530
Subject: [PATCH 1/2] added  q learning

---
 docs/machine-learning/qLearning.md | 128 +++++++++++++++++++++++++++++
 1 file changed, 128 insertions(+)
 create mode 100644 docs/machine-learning/qLearning.md

diff --git a/docs/machine-learning/qLearning.md b/docs/machine-learning/qLearning.md
new file mode 100644
index 000000000..095f156fb
--- /dev/null
+++ b/docs/machine-learning/qLearning.md
@@ -0,0 +1,128 @@
+---
+
+id: q-learning  
+title: Q-Learning Algorithm  
+sidebar_label: Q-Learning  
+description: "An overview of the Q-Learning Algorithm, a model-free reinforcement learning method that learns the optimal action-value function to guide decision-making."  
+tags: [machine learning, reinforcement learning, q-learning, algorithms, model-free]  
+
+---
+
+### Definition:
+The **Q-Learning Algorithm** is a model-free reinforcement learning algorithm used to find the optimal action-selection policy for any given finite Markov Decision Process (MDP). It works by learning the value of actions in specific states without needing a model of the environment and aims to optimize long-term rewards.
+
+### Characteristics:
+- **Model-Free**:  
+  Q-Learning does not require prior knowledge of the environment's dynamics and learns directly from experience.
+  
+- **Off-Policy**:  
+  The algorithm updates the value of actions using the maximum expected future rewards, regardless of the agent's current policy.
+
+- **Exploration vs. Exploitation**:  
+  Balances exploring new actions to find better long-term rewards and exploiting known actions to maximize immediate rewards, often managed using **epsilon-greedy** strategy.
+
+### How It Works:
+Q-Learning learns an **action-value function (Q-function)** that maps state-action pairs to their expected cumulative rewards. The Q-value is updated using the following equation:
+
+\[
+Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]
+\]
+
+- **\( s \)**: Current state
+- **\( a \)**: Action taken in the current state
+- **\( r \)**: Reward received after taking action \( a \)
+- **\( s' \)**: Next state after taking action \( a \)
+- **\( \alpha \)**: Learning rate (controls how much new information overrides old)
+- **\( \gamma \)**: Discount factor (determines the importance of future rewards)
+
+### Steps Involved:
+1. **Initialization**:  
+   Initialize the Q-table with zeros or random values for all possible state-action pairs.
+
+2. **Choose an Action**:  
+   Select an action using an exploration strategy (e.g., epsilon-greedy).
+
+3. **Take Action and Observe Reward**:  
+   Execute the selected action, transition to a new state, and receive the corresponding reward.
+
+4. **Update Q-Value**:  
+   Update the Q-value of the current state-action pair using the Q-Learning update rule.
+
+5. **Repeat**:  
+   Continue until the learning converges or a stopping condition is met.
+
+### Problem Statement:
+Given an environment defined by states and actions with unknown dynamics, the goal is to learn the optimal Q-function that allows an agent to make decisions maximizing cumulative rewards over time.
+
+### Key Concepts:
+- **Q-Table**:  
+  A matrix where each row represents a state, and each column represents an action. The values represent the learned Q-values for state-action pairs.
+  
+- **Epsilon-Greedy Strategy**:  
+  A common method to balance exploration and exploitation. The agent selects a random action with probability \( \epsilon \) and the best-known action with probability \( 1 - \epsilon \).
+  
+- **Convergence**:  
+  Q-Learning converges to the optimal Q-function given an infinite number of episodes and a decaying learning rate.
+
+### Example:
+Consider a grid-world environment where an agent navigates to collect rewards:
+
+- **States**: Positions on the grid (e.g., (1,1), (1,2))
+- **Actions**: Up, Down, Left, Right
+- **Rewards**: +1 for reaching the goal, -1 for hitting obstacles, 0 otherwise
+
+**Update Step**:
+After moving from (1,1) to (1,2) with action "Right" and receiving a reward of 0:
+\[
+Q(1,1, \text{Right}) \leftarrow Q(1,1, \text{Right}) + \alpha \left[ 0 + \gamma \max_{a'} Q(1,2, a') - Q(1,1, \text{Right}) \right]
+\]
+
+### Python Implementation:
+Here is a basic implementation of Q-Learning in Python:
+
+```python
+import numpy as np
+
+# Initialize Q-table with zeros
+q_table = np.zeros((state_space_size, action_space_size))
+
+# Hyperparameters
+alpha = 0.1  # Learning rate
+gamma = 0.99  # Discount factor
+epsilon = 1.0  # Exploration rate
+epsilon_decay = 0.995
+min_epsilon = 0.01
+
+# Training loop
+for episode in range(num_episodes):
+    state = env.reset()
+    done = False
+    
+    while not done:
+        # Choose action using epsilon-greedy strategy
+        if np.random.rand() < epsilon:
+            action = env.action_space.sample()  # Explore
+        else:
+            action = np.argmax(q_table[state])  # Exploit
+        
+        # Take action and observe result
+        next_state, reward, done, _ = env.step(action)
+        
+        # Update Q-value
+        q_table[state, action] = q_table[state, action] + alpha * (
+            reward + gamma * np.max(q_table[next_state]) - q_table[state, action]
+        )
+        
+        # Transition to next state
+        state = next_state
+    
+    # Decay epsilon
+    epsilon = max(min_epsilon, epsilon * epsilon_decay)
+
+print("Training completed.")
+```
+
+### Conclusion:
+Q-Learning is a powerful and foundational reinforcement learning technique that enables agents to learn optimal policies through direct interaction with an environment. Its simplicity and effectiveness make it a popular choice for many RL applications.
+
+---

From 5bb505ae26d90886812fec0d079153a1910f56be Mon Sep 17 00:00:00 2001
From: Ajay Dhangar <99037494+ajay-dhangar@users.noreply.github.com>
Date: Sun, 10 Nov 2024 21:33:36 +0530
Subject: [PATCH 2/2] Update qLearning.md

---
 docs/machine-learning/qLearning.md | 45 +++++++++++++++---------------
 1 file changed, 23 insertions(+), 22 deletions(-)

diff --git a/docs/machine-learning/qLearning.md b/docs/machine-learning/qLearning.md
index 095f156fb..1a3be3b1f 100644
--- a/docs/machine-learning/qLearning.md
+++ b/docs/machine-learning/qLearning.md
@@ -1,16 +1,16 @@
 ---
-
-id: q-learning  
-title: Q-Learning Algorithm  
-sidebar_label: Q-Learning  
-description: "An overview of the Q-Learning Algorithm, a model-free reinforcement learning method that learns the optimal action-value function to guide decision-making."  
-tags: [machine learning, reinforcement learning, q-learning, algorithms, model-free]  
-
+id: q-learning
+title: Q Learning Algorithm
+sidebar_label: Q Learning
+description: "An overview of the Q-Learning Algorithm, a model-free reinforcement learning method that learns the optimal action-value function to guide decision-making."
+tags: [machine learning, reinforcement learning, q-learning, algorithms, model-free]
 ---
 
 ### Definition:
 The **Q-Learning Algorithm** is a model-free reinforcement learning algorithm used to find the optimal action-selection policy for any given finite Markov Decision Process (MDP). It works by learning the value of actions in specific states without needing a model of the environment and aims to optimize long-term rewards.
 
+<Ads />
+
 ### Characteristics:
 - **Model-Free**:  
   Q-Learning does not require prior knowledge of the environment's dynamics and learns directly from experience.
@@ -24,16 +24,16 @@ The **Q-Learning Algorithm** is a model-free reinforcement learning algorithm us
 ### How It Works:
 Q-Learning learns an **action-value function (Q-function)** that maps state-action pairs to their expected cumulative rewards. The Q-value is updated using the following equation:
 
-\[
+$$
 Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]
-\]
+$$
 
-- **\( s \)**: Current state
-- **\( a \)**: Action taken in the current state
-- **\( r \)**: Reward received after taking action \( a \)
-- **\( s' \)**: Next state after taking action \( a \)
-- **\( \alpha \)**: Learning rate (controls how much new information overrides old)
-- **\( \gamma \)**: Discount factor (determines the importance of future rewards)
+- $ s $: Current state
+- $ a $: Action taken in the current state
+- $ r $: Reward received after taking action $ a $
+- $ s' $: Next state after taking action $ a $
+- $\alpha $: Learning rate (controls how much new information overrides old)
+- $ \gamma $: Discount factor (determines the importance of future rewards)
 
 ### Steps Involved:
 1. **Initialization**:  
@@ -51,6 +51,8 @@ Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s,
 5. **Repeat**:  
    Continue until the learning converges or a stopping condition is met.
 
+<Ads />
+   
 ### Problem Statement:
 Given an environment defined by states and actions with unknown dynamics, the goal is to learn the optimal Q-function that allows an agent to make decisions maximizing cumulative rewards over time.
 
@@ -59,10 +61,10 @@ Given an environment defined by states and actions with unknown dynamics, the go
   A matrix where each row represents a state, and each column represents an action. The values represent the learned Q-values for state-action pairs.
   
 - **Epsilon-Greedy Strategy**:  
-  A common method to balance exploration and exploitation. The agent selects a random action with probability \( \epsilon \) and the best-known action with probability \( 1 - \epsilon \).
+  A common method to balance exploration and exploitation. The agent selects a random action with probability $ \epsilon $ and the best-known action with probability $ 1 - \epsilon $.
   
 - **Convergence**:  
-  Q-Learning converges to the optimal Q-function given an infinite number of episodes and a decaying learning rate.
+  Q-learning converges to the optimal Q-function given an infinite number of episodes and a decaying learning rate.
 
 ### Example:
 Consider a grid-world environment where an agent navigates to collect rewards:
@@ -73,9 +75,10 @@ Consider a grid-world environment where an agent navigates to collect rewards:
 
 **Update Step**:
 After moving from (1,1) to (1,2) with action "Right" and receiving a reward of 0:
-\[
+
+$$
 Q(1,1, \text{Right}) \leftarrow Q(1,1, \text{Right}) + \alpha \left[ 0 + \gamma \max_{a'} Q(1,2, a') - Q(1,1, \text{Right}) \right]
-\]
+$$
 
 ### Python Implementation:
 Here is a basic implementation of Q-Learning in Python:
@@ -123,6 +126,4 @@ print("Training completed.")
 ```
 
 ### Conclusion:
-Q-Learning is a powerful and foundational reinforcement learning technique that enables agents to learn optimal policies through direct interaction with an environment. Its simplicity and effectiveness make it a popular choice for many RL applications.
-
----
+Q-learning is a powerful and foundational reinforcement learning technique that enables agents to learn optimal policies through direct interaction with an environment. Its simplicity and effectiveness make it a popular choice for many RL applications.