diff --git a/docs/machine-learning/deep-q-networks.md b/docs/machine-learning/deep-q-networks.md new file mode 100644 index 00000000..7e25e08b --- /dev/null +++ b/docs/machine-learning/deep-q-networks.md @@ -0,0 +1,164 @@ +--- +id: deep-q-networks +title: Deep Q-Networks (DQN) Algorithm +sidebar_label: Deep Q-Networks +description: "An introduction to Deep Q-Networks, a reinforcement learning technique that combines Q-Learning with deep neural networks to handle complex, high-dimensional state spaces." +tags: [machine learning, reinforcement learning, DQN, deep learning, algorithms] +--- + + + +### Definition: +**Deep Q-Networks (DQN)** is a reinforcement learning algorithm that extends Q-Learning by using deep neural networks to approximate the Q-function. This allows DQN to handle high-dimensional state spaces that are not feasible with traditional tabular Q-Learning. The approach was popularized by DeepMind's success in applying DQN to play Atari games at a superhuman level. + +### Characteristics: +- **Combines Deep Learning with Reinforcement Learning**: + DQN leverages neural networks to estimate Q-values, enabling agents to make decisions in environments with complex, high-dimensional state representations. + +- **Experience Replay**: + To improve training stability, DQN stores experiences (state, action, reward, next state) in a replay buffer and samples mini-batches from this buffer to train the network. This reduces correlations between consecutive experiences. + +- **Fixed Target Network**: + DQN uses a separate target network to provide stable Q-value updates. This network is periodically updated with the weights of the main Q-network, preventing harmful feedback loops during training. + + + +### How It Works: +DQN follows the same principles as Q-Learning but uses a deep neural network $ Q(s, a; \theta) $ parameterized by weights $ \theta $ to approximate Q-values. The network is trained to minimize the loss function: + +$$ +L(\theta) = \mathbb{E}_{(s, a, r, s') \sim \text{ReplayBuffer}} \left[ \left( r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta) \right)^2 \right] +$$ + +- **$ \theta $**: Weights of the current Q-network +- **$ \theta^- $**: Weights of the target network (held fixed for stability) +- **$ \gamma $**: Discount factor for future rewards + +### Steps Involved: +1. **Initialize Replay Buffer and Networks**: + Initialize the replay buffer, the Q-network with weights $ \theta $, and the target network with weights $ \theta^- $ (set $ \theta^- = \theta $). + +2. **Choose an Action**: + Select an action using an epsilon-greedy policy based on the Q-values predicted by the Q-network. + +3. **Take Action and Store Experience**: + Execute the action, observe the reward, and store the experience (state, action, reward, next state) in the replay buffer. + +4. **Sample Mini-Batch**: + Randomly sample a mini-batch of experiences from the replay buffer for training. + +5. **Compute Target and Update Weights**: + Compute the target Q-value and update the Q-network weights $ \theta $ by minimizing the loss function. + +6. **Update Target Network**: + Periodically update the target network weights $ \theta^- $ to match the Q-network weights $ \theta $. + + + +### Problem Statement: +Develop a DQN agent capable of learning policies in environments where the state space is large or continuous, such as video games or robotics. + +### Key Concepts: +- **Neural Network Architecture**: + A multi-layer neural network that takes the state as input and outputs Q-values for each action. + +- **Experience Replay Buffer**: + Stores past experiences to break the temporal correlations between consecutive training samples and improve training efficiency. + +- **Epsilon-Greedy Strategy**: + Balances exploration and exploitation by selecting random actions with probability $ \epsilon $ and greedy actions based on the Q-network with probability $ 1 - \epsilon $. + +### Example: +Consider an agent learning to play a simple game with pixel-based input: + +- **States**: Raw pixel data from the game screen. +- **Actions**: Available moves (e.g., left, right, jump). +- **Rewards**: +1 for progressing in the game, -1 for losing a life. + + + +**Training Steps**: +1. Preprocess the state (e.g., resize, grayscale). +2. Feed the state into the Q-network to predict Q-values. +3. Select an action using the epsilon-greedy strategy. +4. Observe the reward and next state. +5. Store the experience in the replay buffer. +6. Sample a mini-batch from the buffer and train the network. + +### Python Implementation: +Below is a simplified version of a DQN implementation in Python using **TensorFlow/Keras**: + +```python +import numpy as np +import tensorflow as tf +from collections import deque +import random + +# Neural network model for the Q-network +def build_model(state_size, action_size): + model = tf.keras.Sequential([ + tf.keras.layers.Dense(24, input_dim=state_size, activation='relu'), + tf.keras.layers.Dense(24, activation='relu'), + tf.keras.layers.Dense(action_size, activation='linear') + ]) + model.compile(optimizer='adam', loss='mse') + return model + +# Replay buffer +replay_buffer = deque(maxlen=2000) + +# Parameters +state_size = 4 # Example state space size +action_size = 2 # Example action space size +gamma = 0.95 # Discount factor +epsilon = 1.0 # Exploration rate +epsilon_min = 0.01 +epsilon_decay = 0.995 +batch_size = 32 +update_target_frequency = 10 + +# Initialize models +q_network = build_model(state_size, action_size) +target_network = build_model(state_size, action_size) +target_network.set_weights(q_network.get_weights()) + +# Training loop +for episode in range(num_episodes): + state = env.reset() + done = False + while not done: + # Choose action using epsilon-greedy strategy + if np.random.rand() < epsilon: + action = env.action_space.sample() + else: + action = np.argmax(q_network.predict(state)) + + # Take action, observe reward and next state + next_state, reward, done, _ = env.step(action) + replay_buffer.append((state, action, reward, next_state, done)) + state = next_state + + # Sample mini-batch and train + if len(replay_buffer) > batch_size: + minibatch = random.sample(replay_buffer, batch_size) + for state, action, reward, next_state, done in minibatch: + target = reward + if not done: + target += gamma * np.max(target_network.predict(next_state)) + target_f = q_network.predict(state) + target_f[0][action] = target + q_network.fit(state, target_f, epochs=1, verbose=0) + + # Update epsilon + if epsilon > epsilon_min: + epsilon *= epsilon_decay + + # Update target network + if episode % update_target_frequency == 0: + target_network.set_weights(q_network.get_weights()) + +print("DQN training completed.") +``` + +### Conclusion: +DQN has significantly advanced the capabilities of reinforcement learning, making it practical to apply RL algorithms in complex environments with large state spaces. By leveraging deep neural networks, DQN has paved the way for applications ranging from gaming to real-world tasks like robotics.