Skip to content

Are there any baselines/benchmarks for continuous action versions of environments? #128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Chulabhaya opened this issue Jan 6, 2025 · 6 comments
Assignees

Comments

@Chulabhaya
Copy link
Contributor

Hi all! I noticed that some of the environments like SMAX and MPE offer continuous action versions of the environments. I put together a continuous action MAPPO based on the existing discrete action MAPPO, but I have no way to tell whether it's achieving expected performance or not. Do you guys happen to have any internal plots of curves showing learning performance on continuous action versions of SMAX and MPE? Thanks in advance!

@amacrutherford
Copy link
Collaborator

Hey! How does your performance compare the to the discrete implementations? I ran these a while ago so do not have the plots to hand

@Chulabhaya
Copy link
Contributor Author

I've only tried two environments so far, but on MPE Simple Spread it gets pretty close, but on SMAX 2s3z it doesn't learn much at all. However I realized I had a bug in my implementation which is that I forgot to account for making sure the actions sampled from the Gaussian fit within the [0, 1] range of the actions for MPE and SMAX. How did you handle that in your original testing? Clipping the sampled actions with min 0 max 1, using a squashed tanh, etc.?

@Chulabhaya
Copy link
Contributor Author

So with clipping implemented for actions, I'm seeing ~20-25% winrate on SMAX 2s3z after 10 million timesteps. This is obviously significantly worse vs. the discrete version which completely solves that environment within ~2-3 million timesteps. However the continuous action space does also make the problem harder; do these results seem on-par or does something seem off?

Here's my implementation of continuous MAPPO: https://pastebin.com/VP0yKY9W

I'm using the following params to get to that ~20-25%, these were initially copied from the discrete MAPPO SMAX config:

"LR": 0.0002
"NUM_ENVS": 128
"NUM_STEPS": 128 
"TOTAL_TIMESTEPS": 1e7
"FC_DIM_SIZE": 128
"GRU_HIDDEN_DIM": 128
"UPDATE_EPOCHS": 4
"NUM_MINIBATCHES": 4
"GAMMA": 0.99
"GAE_LAMBDA": 0.95
"CLIP_EPS": 0.2
"SCALE_CLIP_EPS": False
"ENT_COEF": 0.0
"VF_COEF": 0.5
"MAX_GRAD_NORM": 0.25
"ACTIVATION": "relu"
"OBS_WITH_AGENT_ID": True
"ENV_NAME": "HeuristicEnemySMAX"
"MAP_NAME": "2s3z"
"SEED": 0
"ENV_KWARGS": 
  "see_enemy_actions": True
  "walls_cause_death": True
  "attack_mode": "closest"
  "action_type": "continuous"
"ANNEAL_LR": False

@amacrutherford
Copy link
Collaborator

so you can actually look here for a continous action implementation of IPPO: https://github.com/amacrutherford/sampling-for-learnability/blob/main/sfl/train/jaxnav_sfl.py.

MPE should be pretty much the same between the two implementations. Those SMAX results seem on par with what we found :)

@Chulabhaya
Copy link
Contributor Author

so you can actually look here for a continous action implementation of IPPO: https://github.com/amacrutherford/sampling-for-learnability/blob/main/sfl/train/jaxnav_sfl.py.

MPE should be pretty much the same between the two implementations. Those SMAX results seem on par with what we found :)

Awesome, sounds great! And thank you for the IPPO implementation, I will definitely take a look at that. Would you guys be interested if I made a PR for continuous action MAPPO for review?

@amacrutherford amacrutherford self-assigned this Jan 20, 2025
@amacrutherford
Copy link
Collaborator

yes please that would be fab! Please include a training curve on the PR contrasting the performance of the continuous action implementation with discrete for MPE :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants