Check the file 6_LunarLander_PolicyBased
which implements the cross-entropy method (CEM) on the LunarLander environment.
This implementation is incomplete (and does not work). To fix it, implement the following:
- Currently the state-action pairs of all episodes of one batch are used for training. To implement CEM correctly, use only the state-action pairs of the twenty best episodes in one batch in terms of highest reward.
- Train the network with the correction from task 1.
- After training, test the agent by running one episode using the trained network and record the episode (see the notebooks from lab 5 for code templates for recording.)