This repository contains a solution for the CartPole-v1 problem of the gymnasium library with Deep Reinforcement Learning.
The project focuses on two major algorithms: DQN
and SARSA
, and evaluates their performance in solving the CartPole-v1 problem.
The CartPole-v1 environment involves balancing a pole on a cart that moves along a frictionless track. The agent's task is to prevent the pole from falling by applying forces to the cart. Below are the key features and conditions:
Property | Details |
---|---|
Goal | Keep the pole balanced for as long as possible. |
Reward | +1 for every time step the pole is balanced. |
Termination Conditions | Pole angle exceeds 12° or cart position exceeds the track boundaries. |
Maximum Episode Length | 500 time steps. |
Cart Position (x)
Cart Velocity (ẋ)
Pole Angle (θ)
Pole Angular Velocity (θ̇)
0 Push cart to the left.
1 Push cart to the right.
The task is episodic, and a well-trained agent aims to keep the pole balanced for the maximum reward of 500. For more information, visit the CartPole-v1 documentation.
-
Algorithm Implementation:
- Implement DQN and SARSA algorithms and train agents separately.
- Compare the success or failure of each algorithm, focusing on their convergence speed, reward maximization, and overall performance.
-
Evaluation Metrics:
- Plot graphs for:
- Rewards: The rewards earned by the agent.
- Loss (for DQN): The error in predicting future rewards.
- Epsilon Decay: The exploration-exploitation trade-off.
- Plot graphs for:
-
Hyperparameter Tuning:
- Train the DQN model with at least three different sets of hyperparameters and report the results in a table format.
- Analyze the impact of these hyperparameters on the performance of the models.
-
Testing:
- Test functions are provided in the code, and all final weights are saved in the respective files for reproducibility.
Below is the table for the hyperparameters used in the DQN and SARSA algorithms based on the three stages of experimentation:
![DQN Hyperparameters](https://private-user-images.githubusercontent.com/79360286/371162355-dd3bff16-063f-4122-b867-029d8fbd964e.jpg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0MTgxMDksIm5iZiI6MTczOTQxNzgwOSwicGF0aCI6Ii83OTM2MDI4Ni8zNzExNjIzNTUtZGQzYmZmMTYtMDYzZi00MTIyLWI4NjctMDI5ZDhmYmQ5NjRlLmpwZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDAzMzY0OVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWVmYTdjNjk4YTM1OWY2ZDBiZDA3ZmQ1MmI0OTM3ZmIwMThlNWNmYTg4MDgzNTNiMmZjMTM3YzNkZjQ2YTAwOTMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.1vGqSkS_lHv6R3GOG_EQTjESjXXsJ6gSttu4UcuoaWI)
![SARSA Hyperparameters](https://private-user-images.githubusercontent.com/79360286/371162375-2723684d-ec79-4f15-9412-fb32de9eabad.jpg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0MTgxMDksIm5iZiI6MTczOTQxNzgwOSwicGF0aCI6Ii83OTM2MDI4Ni8zNzExNjIzNzUtMjcyMzY4NGQtZWM3OS00ZjE1LTk0MTItZmIzMmRlOWVhYmFkLmpwZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDAzMzY0OVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWU2OTFlZWExYjk0MzUxZmQ5NTU4YTIzZWZjOGIxZjg2ZmE5YTQ2N2Q2NjMzOGViOWZmNDQxMjZhMzIxNTljYTEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.nVqcJvH1qUbQX0n6JaF87pJb8yHL9xDSMZiEQ_U9kJQ)
-
Convergence:
DQN consistently outperforms SARSA by converging faster and achieving higher rewards. SARSA, being an on-policy algorithm, is less effective in utilizing experience compared to off-policy DQN.- DQN: Uses experiences multiple times (replay buffer) and selects actions based on a max Q-value approach, leading to better performance.
- SARSA: Follows the policy directly, which results in slower convergence and noisier performance.
-
Results Summary:
- DQN converges to optimal rewards, while SARSA struggles with noisy updates.
![DQN Rewards](https://private-user-images.githubusercontent.com/79360286/371164767-3faaa044-d5df-44c0-b620-e42cf1b1b69d.jpg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0MTgxMDksIm5iZiI6MTczOTQxNzgwOSwicGF0aCI6Ii83OTM2MDI4Ni8zNzExNjQ3NjctM2ZhYWEwNDQtZDVkZi00NGMwLWI2MjAtZTQyY2YxYjFiNjlkLmpwZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDAzMzY0OVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWExMmRkOTAxNjU5MTRhODMxZGVjN2RiMGExODNkZGFkNTgyYWM3MTFmZDczYTFiY2Y0ZDk4YTUwZWY1NTg3MDYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.N0SAISXHyZRiHoJ0ReQ86803c2S2j6c-RGl45txovVo)
![DQN Loss](https://private-user-images.githubusercontent.com/79360286/371164707-6c8fecdb-fd6f-4607-af14-76cda3b96809.jpg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0MTgxMDksIm5iZiI6MTczOTQxNzgwOSwicGF0aCI6Ii83OTM2MDI4Ni8zNzExNjQ3MDctNmM4ZmVjZGItZmQ2Zi00NjA3LWFmMTQtNzZjZGEzYjk2ODA5LmpwZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDAzMzY0OVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPThhM2M5NDM3NzJmMjQ2ZmVmYTg3NTY1YTI0NmJjMWYxMzAyNjBjN2U1MmVjY2Y2ZWQ0YWJkMjdkYzBkMTBmMTAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.QkSFEuCmSGOkxpgq6m42Up0ScgFlf6Bcw9qUeuMKjeQ)
![DQN Epsilon](https://private-user-images.githubusercontent.com/79360286/371164697-74ffc7fe-5a55-467d-ac04-0ad9cbe48945.jpg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0MTgxMDksIm5iZiI6MTczOTQxNzgwOSwicGF0aCI6Ii83OTM2MDI4Ni8zNzExNjQ2OTctNzRmZmM3ZmUtNWE1NS00NjdkLWFjMDQtMGFkOWNiZTQ4OTQ1LmpwZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDAzMzY0OVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTJmZTFkMTg1NzdhYmU5NWU5MzExMTkwNWY1NDJhOGMyMjRhZWQ5YzE4NWM2NjcwZTBjODhlMTRlYjViMjc1NTImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.WGAKGcrjiBCdcDY1pUi5iuyOeohd934edroV78vUft0)
![SARSA Rewards](https://private-user-images.githubusercontent.com/79360286/371165817-8e6f27aa-c492-4948-9a1b-0babc35681e9.jpg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0MTgxMDksIm5iZiI6MTczOTQxNzgwOSwicGF0aCI6Ii83OTM2MDI4Ni8zNzExNjU4MTctOGU2ZjI3YWEtYzQ5Mi00OTQ4LTlhMWItMGJhYmMzNTY4MWU5LmpwZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDAzMzY0OVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTRhNDQwNTYyZTUxM2E1MzViMTUwOTQ2MDlhNDkwNWZlN2M4ZjU0ZWU0MDVmMWNmZGU5ZTlhMjZhMGQ2N2I4NzcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.-pwhjNuv5SnLfmP0u4lFG2R44qLwq90ICq1Fk7_a25o)
![SARSA Loss](https://private-user-images.githubusercontent.com/79360286/371165841-309c2a9a-5abd-46f0-ac34-ee4f80b617fd.jpg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0MTgxMDksIm5iZiI6MTczOTQxNzgwOSwicGF0aCI6Ii83OTM2MDI4Ni8zNzExNjU4NDEtMzA5YzJhOWEtNWFiZC00NmYwLWFjMzQtZWU0ZjgwYjYxN2ZkLmpwZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDAzMzY0OVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTJkN2RhZDVkY2JjYWFlYTFlYWU1NDdmMGNkMzQ1YTViOTY1MmUzYmVlYzFhNDE3OWEzMWQ3YTBkY2IyMTFhYTAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.-fYy2xHTAX6NDD-RABD_oYLemyt9XapiYAJyDov5OIA)
![SARSA Epsilon](https://private-user-images.githubusercontent.com/79360286/371165849-427f8900-c147-4e88-a287-b72dae56b8c8.jpg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0MTgxMDksIm5iZiI6MTczOTQxNzgwOSwicGF0aCI6Ii83OTM2MDI4Ni8zNzExNjU4NDktNDI3Zjg5MDAtYzE0Ny00ZTg4LWEyODctYjcyZGFlNTZiOGM4LmpwZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDAzMzY0OVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWJmNzk1ZDljM2JhNWM5YmNiNGZmNzk5OGE0NjlkNGYxMzkyOTFlYTIxMTVjOTcyMTIxNTVhY2VjMGZlZDcwOGQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.Nz_BVthY4Al1wt2LC3arwQ9d_SMBIgTyXgC4pbk3raw)
-
Boltzmann Exploration:
Instead of using epsilon-greedy for exploration, Boltzmann exploration was implemented. In this strategy, temperature parameters control the randomness of action selection. -
Parameters:
- Temperature: Controls exploration intensity.
- Decay Rate: Determines how fast the temperature reduces.
-
Results:
- Boltzmann temperature control led to faster convergence compared to epsilon-greedy.
- Hyperparameter tuning further accelerated convergence, with optimal parameters leading to early stopping at around 1500 episodes.
![Boltzmann Rewards](https://private-user-images.githubusercontent.com/79360286/371168874-b635d979-dd86-4c5f-904a-866411d76d91.jpg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0MTgxMDksIm5iZiI6MTczOTQxNzgwOSwicGF0aCI6Ii83OTM2MDI4Ni8zNzExNjg4NzQtYjYzNWQ5NzktZGQ4Ni00YzVmLTkwNGEtODY2NDExZDc2ZDkxLmpwZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDAzMzY0OVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTQ5MWZmZWY1NmQ4M2E0MzQ3NTBkNTkzNjI5MDgzMjM5OTg0MWU5NzFlMWYxNjljODc4ZjBmYzBmYjM0NmVmNzImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.1cz7hOZt9rlhlPJZJBfu1Exk_BBJdowvOZpmKEmSApE)
![Boltzmann Loss](https://private-user-images.githubusercontent.com/79360286/371168847-84308e0d-d8d9-4c79-9d7a-5155c85c8ea6.jpg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0MTgxMDksIm5iZiI6MTczOTQxNzgwOSwicGF0aCI6Ii83OTM2MDI4Ni8zNzExNjg4NDctODQzMDhlMGQtZDhkOS00Yzc5LTlkN2EtNTE1NWM4NWM4ZWE2LmpwZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDAzMzY0OVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTg5MjE4NjRkNzc4ZmQ1NTA0YWMyNTYyNTE0YzUwYWUyNmUwODc3NmUzYWQ3MjQ1NmU4NTcxOTM2NzkxNmZhM2QmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.0vCSCQ2GZupe--PMu4W1ffXKfEyN0JkCLPEJADmtG84)
![Boltzmann Temperature](https://private-user-images.githubusercontent.com/79360286/371169473-4ff84bc6-4f2d-4617-8f49-b82625a6ee83.jpg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0MTgxMDksIm5iZiI6MTczOTQxNzgwOSwicGF0aCI6Ii83OTM2MDI4Ni8zNzExNjk0NzMtNGZmODRiYzYtNGYyZC00NjE3LThmNDktYjgyNjI1YTZlZTgzLmpwZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDAzMzY0OVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTNlODcxNDE4ODhmY2E3Y2UxNzI3MDkxY2E0ZThjZmJiNzQwNDNjZjY1OThjYTI2ZGIzNTUzYjI0ODBhZjQzNzYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.7-ROeiDY3JaABn952zLLbJXSMMRXs5bWvuCkHl5yzO8)
Parameter | Run 1 | Run 2 | Run 3 | Optimal |
---|---|---|---|---|
Learning Rate | 2.3e-2 | 2e-2 | 1e-3 | 2.3e-2 |
Discount Factor | 0.9 | 0.98 | 0.96 | 0.93 |
Update Frequency | 8 | 5 | 15 | 10 |
Agent.s.game.play.MP4
This project demonstrates the advantages of DQN over SARSA in reinforcement learning tasks, particularly when dealing with environments that require efficient exploration and experience replay. Boltzmann exploration proved more effective than epsilon-greedy, especially when tuned correctly.
To run the code:
- Install the required libraries:
pip install gymnasium
- Clone the repository:
git clone https://github.com/navidadkhah/CartPole-V1
cd CartPole-V1
- Run the training scripts for DQN and SARSA.
- You can run test function using their corresponding weights.
This project is under the MIT License, and I’d be thrilled if you use and improve my work!