This section explains first, the implementation and the execution of the algorithm and then argue the results. Moreover, it discusses a variant of the algorithm. This variant consists of the classical algorithm, but, the root of the three and the node from where the roll-out is computed change during the algorithm computation. Finally, it makes a comparison of the two algorithms. This report follows the attached Python notebook's structure (there is no need to check the code to understand this report content).
The implementation has been realized from scratch. Each node of the tree is represented through a Python object. This object contains the essential information and methods to make the node fully independent from the tree itself (useful during the debugging phase). Each node links to its parent and the left and right child. In this way, it is possible to build the binary tree recursively. Besides, just the leaf nodes have a value; this is randomly generated within an uniform distribution between
The experiments have been performed on a binary tree of a depth
See the report
In this variant, the MCTS-iterations starting from a specific node are limited to 10. After those 10 iterations, the root changes and one of the best former root's children became the new root. Moreover, a similar assumption has been done for the number of roll-outs starting from a particular snow-cap node. Indeed, the roll-out starts from the same node two times in a row then, it changes to the most promising node in the snow-cap. When this change occurs, even if the new roll-out node" collects worse results, the old
roll-out node" can not be used as such anymore.\
Using the same
As for the first section, the code has been written from scratch. Similarly, as I did for the binary tree, I decided on an Object-Oriented approach. Each gridworld cell is independent and contains all the information that concerns the cell itself (type, coordinates on the grid, immediate reward, state-value and action-value functions, and so forth). Another object manages the logic of the world. Cell values can be modified through this last object. Besides, it returns the states according to the agent actions. It also allows to print the cells (for debugging) and plot the gridworld heat-map.
his section explains the implementation of the Monte-Carlo algorithm for policy evaluation and then discusses the results. \ I decided to implement the first-visit and theevery-visit algorithms and compare the results.
An object encapsulates the logic of Monte-Carlo algorithm. This object takes in input the policy with which the agent takes decisions. This feature allowed me to play with policies during development. Furthermore, the number of episodes per simulations is editable. The results, represented as heatmaps, show the state-value functions of the environment after all the episodes taken into account.
I decided to run the algorithm(s) with an increasing number of episodes. In this way, it is possible to see how the gridworld changes. The value of
For the Heatmap and the stats see the report.
The implementation of the SARSA and Q-Learning agents is summarized in the diagram in figure (vd report). An abstract class contains all the methods and parameters common to both algorithms. The two implementations are completely sticking with the pseudo-code seen during the lectures. As did for Monte-Carlo policy evaluation, both, Q-Learning and SARSA, are computed used increasing number of episodes. To obtain more precise results, for each number of episodes the algorithm is computed
Starting with 10 episodes, this experiment aims to show how the quality of the results grows with the increasing of the number of episodes. All the experiments have been carried out using a learning rate
To obtain balanced data, each episode consists on the agent starting from the top-left cell (1,1) until it reaches a terminal cell (Pit or Goal). The figures 9-13 show the expected action-value functions of the environment computed by the algorithm after a certain number of episodes (excluding walls, pit, and goals). The figures show the average results after all the episodes and the 1000 repetitions of each experiment.
The results are computed as for SARSA, using the same parameters and number of episodes.\ Discussing the results from figure 14-18, We can easily notice that this algorithm is way faster at reaching an acceptable policy. Indeed, with 100 episodes, Q-Learning agent is already able to draw a pattern on the gridworld. Nonetheless, the average total reward is still negative. Besides, in 300 episodes, this agent reaches the total reward that SARSA agent obtains in 500 episodes. As expected, the Q-Learning algorithm converges faster than SARSA. The last figure of Q-Learning results shows a more ``convinced" path from the starting node to the Goal.
The two algorithms are very similar in terms of execution time. The key difference between these two algorithms is shown on figures 19-20. These gridworlds represent the result of an "exaggerated" experiment run with