Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Gymnasium 1.0 compatibility #2473

Open
wants to merge 2 commits into
base: gh/vmoens/31/base
Choose a base branch
from

Conversation

vmoens
Copy link
Contributor

@vmoens vmoens commented Oct 9, 2024

Stack from ghstack (oldest at bottom):

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Oct 9, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/2473

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 13 Unrelated Failures

As of commit 33c429b with merge base fac1f7a (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vmoens added a commit that referenced this pull request Oct 9, 2024
ghstack-source-id: 305b15271bd5b0cd7d9b55a9f8b2079bbc40950f
Pull Request resolved: #2473
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 9, 2024
@vmoens
Copy link
Contributor Author

vmoens commented Oct 9, 2024

@pseudo-rnd-thoughts @jkterry1 I would need some guidance here for the partial resets.

If I understand correctly, now if an env is done, the step returns the next obs and when the step will be queried next time the action will be ignored and the reset obs will be returned.
In torchrl, what we do is (simplified with a single env)

def step_and_maybe_reset(self, data):
    next_obs, rew, term, trunc, info = self._env.step(data["action"])
    if term | trunc:
        reset_obs = self._env.reset()
    else:
        reset_obs = next_obs
    data.set("next", TensorDict(obs=next_obs, reward=rew, ...)
    maybe_reset = data.empty().set(obs=reset_obs, ...)
    return data, maybe_reset

I'm trying to think about how I could adapt step_and_maybe_reset but it isn't easy since I should be calling self._env.step with a new action to get reset_obs.

I cannot think of any way to make env_and_maybe_reset behave correctly since I need the action at t+1 (for envs that should not reset) - am I correct?

Other more metaphysical questions:

  • what effect does the action have when passed to step when step means reset?
  • Assuming the answer is "it has no effect": If it's expensive to compute the action and many envs are resetting (or envs are resetting frequently), isn't this a loss of resources?
  • What values will reward, termination and truncation have when they come from a reset?
  • In the example below, we're waiting one step to read what autoreset is doing
    replay_buffer = []
    obs, _ = envs.reset()
    autoreset = np.zeros(envs.num_envs)
    for _ in range(total_timesteps):
        next_obs, rewards, terminations, truncations, _ = envs.step(envs.action_space.sample())
    
        for j in range(envs.num_envs):
            if not autoreset[j]:
                replay_buffer.append((
                    obs[j], rewards[j], terminations[j], truncations[j], next_obs[j]
                ))
    
        obs = next_obs
        autoreset = np.logical_or(terminations, truncations)
    Am I correct to think that previously, if I was looping N times I would get N steps, now I get anything from N//2 to N steps? Isn't that a problem if I'm designing a training loop where I'm giving myself a budget of N steps?
  • I now see some control flow appearing in the collection loop. Isn't the whole point of vectorized envs to get rid of that? You could tell me that this is a simplified version and you could mask the obs / rewards etc I guess... but why not adopting a loop like this, where you can probably let users just extend their buffers with the peace of mind of not having to copy some error-prone boilerplate code
    for _ in range(N):
        action = policy(obs)
        next_obs, maybe_reset_obs, rewards, term, trunc, info = env.step(action)
        buffer.extend((obs, action, next_obs, rewards, term, trunc, info))
        obs = maybe_reset_obs

Thanks!

[ghstack-poisoned]
@vmoens vmoens added the Environments Adds or modifies an environment wrapper label Oct 9, 2024
vmoens added a commit that referenced this pull request Oct 9, 2024
ghstack-source-id: eed26cf693a936c60ee7607ce6b8e5fd6b994953
Pull Request resolved: #2473
@pseudo-rnd-thoughts
Copy link

@vmoens Thanks for your message, I'll try to answer the best I can

what effect does the action have when passed to step when step means reset?

The action is discarded or "has no effect"

Assuming the answer is "it has no effect": If it's expensive to compute the action and many envs are resetting (or envs are resetting frequently), isn't this a loss of resources?

This adds an extra step to vector environments when resetting so this cost is dependent on the reset frequency. Yes, this is a loss of resources, however, in my opinion, very minimally.

What values will reward, termination and truncation have when they come from a reset?

The reward will be zero, termination and truncation is False.

Isn't that a problem if I'm designing a training loop where I'm giving myself a budget of N steps?

Possibly, if you reset every step, this reduces your practical budget to N//2 however practically, I would imagine this to be very rare. How you would specify that you are precisely learning on N steps is less clear, admittedly, now.

control flow

You are correct, but the boiler plate is assuming that users have a single environment buffer thus each sub-environment data must be extracted. For vectorised environment buffers, this control is unnecessary but it is critical that you are not using the end of one episode and the begining of another in your training data (or it must be masked out).

@RedTachyon and I are planning on releasing a blog post about this more because we understand the complexity of it for projects to adapt

@vmoens
Copy link
Contributor Author

vmoens commented Oct 9, 2024

What about an approach that does not need a blog post to adapt? :D
Like the one I proposed above where you always get a maybe_reset_obs?
that would not consume more mem or compute than what you have now I think, and it'd be easier to integrate

@vmoens
Copy link
Contributor Author

vmoens commented Oct 9, 2024

@pseudo-rnd-thoughts you don't have partial resets right? Smth like reset(mask)?

@matteobettini
Copy link
Contributor

matteobettini commented Oct 9, 2024

@pseudo-rnd-thoughts I'll drop here my 2 cents as the creator of VMAS which is a vectorized env

It seems to me that the new API for autoresets in step is quite problematic from multiple points of view:

  1. it will cause significant increase in the number of step calls, cause it will have to be called once more for every reset
  2. it will cause significant increase in the number of calls to the policy, which has to be called even if 99% of the envs are done just to provide the valid 1% actions (keep in mind that the policy could be an expensive LLM and we are throrwing away up to 99% of the output)

I think the impact of this is bigger than we can imagine. I work with batches of hunderds of thousands of envs on GPU and it is very probable that in one step at least one env will be done.

I don't know why this is the route that was taken, but I don't think it will be sustainable.

The solution is really simple in my opinion and it is to keep step just doing steps on all envs and have something like env.reset(done) which resets using a mask (or index or anything). This mask input could be optional and by default env.reset() will do a full reset

@vmoens
Copy link
Contributor Author

vmoens commented Oct 9, 2024

Since it's an API that is also used by envpool and others we could think of a way to circumvent this when calling step_and_maybe_reset but as of now I think the most immediate solution is going to be to tell people not to use gymnasium 1.0 with torchrl until a fix is found...

I agree with Matteo that having a method call step that under the hood ignores the input, provides garbage output and actually does something different than step looks like a misstep. That being said, I assume it's a decision from the gymnasium community of users in which case we can only acknowledge it and work around that...

To unblock this, I can give a shot at building a wrapper that does some sort of partial reset maybe? Thoughts?

@pseudo-rnd-thoughts
Copy link

pseudo-rnd-thoughts commented Oct 9, 2024

@vmoens and @matteobettini Thanks for your input, I'm happy to have a zoom call to work out a solution if that helps.

To clarify, the previous autoreset (reset within the same step) is still compatible with gymnasium.vector.VectorEnv, it is just that the implemented vectorisation approaches within Gymnasium do not implement this and is not compatible with the gymnasium.vector.wrappers. It might be possible to add support for both API, but it won't be easy.

you don't have partial resets right? Smth like reset(mask)?

Normally a vector environment from Gymnasium, you only call reset at the beginning with users calling step which will autoreset sub-environment that terminate or truncate so it wouldn't make sense to have a reset(mask) under this API if I understand correctly.

@matteobettini I admittedly didn't test this across a wide range of environments, but I didn't expect that adding an additional call to a policy would cause significant problems as most environments take a few hundred steps at minimum before termination, let alone truncation. Meaning that I expected this to cause only a minimal impact on performance.

having a method call step that under the hood ignores the input, provides garbage output and actually does something different than step looks like a misstep.

The problem as I see it is that an autoreset must happen under the hood to reset sub-environments, either within the same step as a termination / truncation or on the next step.

@matteobettini
Copy link
Contributor

matteobettini commented Oct 9, 2024

Thanks for the reply, I am available if we set up a call

The problem as I see it is that an autoreset must happen under the hood to reset sub-environments, either within the same step as a termination / truncation or on the next step.

Yes I think the core problem discussed here is that doing it in the same step is a better solution as distributing it across 2 steps has the problems highlighted above.

In your example of the 200 steps. If you have 200 enviornomnets terminating in 200 steps and they are perfectly out of sync, you will be doing a reset at every step. Imagine with more enviornments... 200 is nothing compared to 200_000

More in general regarding autoreset, I am a bit constrained by it. I understand that in gym this is how things are done currently, but for example in my use case I have a VMAS env (which has no autoreset) and I would like to wrap it with a gymnasium VectorEnv and I can't because gymansium assumes that everyone will use autoresets, while I would prefer the general API not to prescribe if you do autoresets or not (and thus provide a reset function that takes an optional mask).
Users can then implement it using autoresets and env.reset() or using normal steps and env.reset(done). This is more of my own struggle with the API and not super related to the problem discussed here, although it is a solution for it

@vmoens
Copy link
Contributor Author

vmoens commented Oct 9, 2024

@pseudo-rnd-thoughts is there any way one could implement a reset(done) like @matteobettini was suggesting, or anything where we don't have to deal with garbage data?

@pseudo-rnd-thoughts
Copy link

The vector API doesn't require autoreset to be implemented but it has been designed with this in mind.
You could pass this data through reset(options={"mask": ...}) as well.

@matteobettini
Copy link
Contributor

matteobettini commented Oct 9, 2024

To get back to the issue at hand, here is a recap:

It seems like in torchrl we are not able to support resets happening over 2 step calls. So in order to work with the new gymnasium VectorEnv we would need either:

  • a way to disable autoresets and call env.reset() with a mask of what to reset
  • a way to go back to autoresets happening in one step (previous version)

@pseudo-rnd-thoughts is there any way we can make either happen?

(my fav would be the first)

@matteobettini
Copy link
Contributor

matteobettini commented Oct 9, 2024

I think I spotted another criticality related to this.

The API is assuming that it possible to step the sub envs independently (which is not true in batched simulators as it would defeat their purpose)

Here is the example:

  • I have 4 envs: [0,1,2,3]
  • i called step and 2 and 3 are done
  • now when I call step again, 2 and 3 will be reset
  • but then if my simulator uses a batch of envs (and not a list of processes), I cannot step dimensions 0 and 1 without stepping also 2 and 3
  • so my step will step all envs and I will have to consider the actions and the new data for 2 and 3 as well

So for all environemnts that cannot step subenvs independently the new API cannot work

@RedTachyon
Copy link

I might address some other points later (discussing with Vincent "offline" in parallel), but one thing I want a sanity check on regarding the last message:

So for all environemnts that cannot step subenvs independently the new API cannot work

Wouldn't the same be true for the old API? (which also featured an autoreset, just shifted by one step)

@matteobettini
Copy link
Contributor

matteobettini commented Oct 9, 2024

Wouldn't the same be true for the old API? (which also featured an autoreset, just shifted by one step)

No, the old API should be fine, because it did (correct me if i am wrong):

When step is called:

  • step all envs
  • reset the done ones
  • store the data from the step and reset in 2 different places and output both

So it always has a global step

I would hardly call the new API a "shift", I think its implications are quite ramified

@pseudo-rnd-thoughts
Copy link

@matteobettini I don't know how your problem is implemented but can you not pass a no-op like action to the simulator?
Otherwise, it would seem that you can still use Gymnasium v1.0 but your vector environments use a different autoreset API than default. I can look to see what we can do to support both API's, but there are limitations.

@matteobettini
Copy link
Contributor

matteobettini commented Oct 9, 2024

@matteobettini I don't know how your problem is implemented but can you not pass a no-op like action to the simulator? Otherwise, it would seem that you can still use Gymnasium v1.0 but your vector environments use a different autoreset API than default. I can look to see what we can do to support both API's, but there are limitations.

I am not talking about my problem here, I am talking about any batched simulation framework (isaac, vmas,brax...) or any batched policy, where
the simulator step is some function of the imput $o'=f(\pi(o))$, where $o,o'$ have n_envs as the leading dimensions.

the core point of batching and vectorization in RL is that the functions $f$ and $\pi$ can run independently of n_envs and scaling in n_envs makes them more efficient.

What the new API implies is that you will have to compute $\pi$ and $f$ over partial data (partial steps and partial actions when a reset happens) .

This is a major problem because this API is essentially eliminating all the benefits of batching and vectorization

    • the benefits of batching in $\pi$ (which could perviously be achieved also without a batched env) are undone because some actions will be ignored
    • the benefits of batching in $f$ (which were enabled by gpu accellerated envs) are undone beacuse gymnasium requires partial steps

2 is only visible when using batched simulators (e.g., isaac, brax, vmas; assuming they even support partial steps), but 1 will be visible in any training library based on this API, independently of the simulator

@RedTachyon
Copy link

I'd really, really like to ask people to turn down the temperature and stop overexaggerating.

This is a major problem because this API is essentially eliminating all the benefits of batching and vectorization

This is just not true. I know this for a fact, because I'm working on a project that uses the new API, that benefits from batching and vectorization. Beyond that, Gymnasium ships with some hardware-accelerated Jax-based environments.

There's probably a more targeted argument that you can make - in which, case, please make that argument instead of saying false things, because there's no point for us to engage with that.

@matteobettini
Copy link
Contributor

matteobettini commented Oct 9, 2024

I did not want to step up the tone, sorry for that, I am just trying to bring this issue to light.

Let me try to prove that there is a case where you would loose all benefits of batching with a worst-case scenario study.

Assume you have $N$ enviornments.

The worst case scenario happens when all envrionments are done except 1 .

You would then call the policy with N obs, but then N-1 are discarded.

Then your step function would call the physical step with 1 action and N-1 no-op actions.

This is why i believe my statement is true.

This is because the reset happens after the policy and before the step. In the previous API, the reset happened ater the step but before the policy, making all calls to the physical step and the policy use N valid inputs.

I know that this is a worst-case scenario, but computer science often studies complexity using worst-case analysis.

I understand that empirically you use jax and are doing fine, but I am just trying to understand the theoretical implications here.

@pseudo-rnd-thoughts
Copy link

Maybe I'm too much of an empiricist, but to me, this worst-case analysis should be a minimal problem in practice, so I didn't take it much into consideration. If you find this significantly increases compute time, please let me know.

any paper using gymnasium>=1.0 and saying "we trained this algorithm for 1M steps" is lying to you

@vmoens Ironically, the opposite is true (to my knowledge) with the no-op action for terminated / truncated states; then 1 million steps is an upper bound, not the lower bound. In reality, they are probably taking fewer steps than that of the actual environments unless they try to be fancy then yes, they could mess it up but that is their fault not gymnasium's imho.

@vmoens
Copy link
Contributor Author

vmoens commented Oct 9, 2024

@RedTachyon I agree let's lower the tone! At the end of the day what we want it to support the community better.

Let me first try to explain why it is so difficult for us to adapt to these changes:

Concretely speaking, this change is impossible to incorporate in torchrl without some heavy, painful and probably bc-breaking refactoring. To put it simply: env.step_and_maybe_reset is our main entry point for doing a step and reset combined. It looks at the done states and, if needed runs a reset in the env. If the env is auto-resetting, it fetches the final obs somewhere (previously in info dict) and stores it at it appropriate location. This function returns two objects: one which is the input data updated with the content of the current traj, and another that is either the next step base data or the reset data. You can use the second object to feed your policy.

Our goal is to support the RL community, whether they use gym or something else and here the cost seems too high to pay for the general community. I understand gym wants to be the one default backend for simulation in RL and related fields, but from where we stand we can't ignore other frameworks.
Unfortunately, any solution that requires users to run the policy on observations that are out of the trajectory and throw away the actions manually isn't compatible with our current API and probably won't be. It would introduce some heavy control flow in training scripts / trainers / buffers and make the library less modular (because one component like the buffer would depend on the backend of the env, which is surprising).
With the previous API, we could fit the gym vectorized envs within a wrapper and all the actions were always valid, so there was no need for the users to worry about the backend they were using.
We want to allow people to code environments directly with TorchRL: the user base is not as big as Gym, I admit, and I also understand this should not be a worry for you that your API matches anyone else's in any way. But as it is now, the change is too big and too abrupt to be incorporated and, as you guys, we have a very limited bandwidth to refactor things within the library.
I hope that makes sense!

To understand why we don't encounter the problem of reset vs final obs in torchrl, I invite you to read about torchrl format here. I hope this will help you understand where we come from!

To move forward:

  • Is there a way to detect whether an env is autoresetting with the new API? If so we could just error when such env is detected by GymWrapper and not in other cases. We wouldn't want to have to just prohibit users from using gymnasiun 1.0 across the lib!
  • If we can help in the design of new gym wrappers or methods that permit partial reset or a fallback on the previous API (which we did not like much but at least managed to make work) we'd be happy to dedicate time doing that.

@matteobettini
Copy link
Contributor

Maybe I'm too much of an empiricist, but to me, this worst-case analysis should be a minimal problem in practice, so I didn't take it much into consideration. If you find this significantly increases compute time, please let me know.

I believe it could show up in practical examples. Imagine a self-driving simulator where you have 4 actions for driving and one for a hand-break. Probably, pulling the hand-break at the wrong time would lead to a crash. I believe that a scenario like this would come close to what I outlined as a lot of sub envs would be frequently done during exploration and you would see empirically the difference.

I guess the point I am making is that the cost of the previous API did not depend on the task, and now it does, with the worst case being what i outlined and a scenario like the above empirically approaching that

@vmoens
Copy link
Contributor Author

vmoens commented Oct 11, 2024

@pseudo-rnd-thoughts

Let me rephrase and expand a bit on the accuracy point. Here's another simple example of why I think this is an issue:

CartPole (arguably the most popular and simple env ever) has a reset frequency tightly linked to the perf of the algo.
Say I allocate myself a 1M steps budget for training. Now take algo 1 which is not good, giving me many many resets, and algo 2 much more efficient, giving me very few resets.

With gym 1.0, algo 1 will have fewer total steps so probably an even poorer perf. The gap between 1 and 2 will be bigger than it should.
Therefore smth irrelevant (simulator API) will impact the measure of how good an algo is.

Note that the effect is opposite if the env resets more frequently when it's solved (imagine getting out of a maze the fastest possible => more success = more resets = less training data).

Does that make sense?

To move forward:

  • Is there a way to detect whether an env is autoresetting with the new API? If so we could just error when such env is detected by GymWrapper and not in other cases. We wouldn't want to have to just prohibit users from using gymnasiun 1.0 across the lib!
  • If we can help in the design of new gym wrappers or methods that permit partial reset or a fallback on the previous API (which we did not like much but at least managed to make work) we'd be happy to dedicate time doing that.

Any thoughts on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Environments Adds or modifies an environment wrapper
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants