-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detachment of env and proxy; avoiding copies of the environments #299
Conversation
…ultiple instances of the env in the agent instead of copying it
…ewards() via a log parameter; fixes
…roxy config files
…ts are WIP and everything needs further testing.
…eward instead of in rewards()
…proxy2reward and beta values in configs but as proxy config
Ok, to me this is now ready to merge. @carriepl @AlexandraVolokhova I'm tagging you as reviewers in case you want to take a quick look and to let you know. Feel free to go ahead and approve and merge unless you have any concerns. Sanity checks look perfect: https://wandb.ai/alexhg/gfn_sanity_checks (the issue with FL was that the sanity checks config files was incorrect for the new proxy - fixed in 361cdc0) |
@ginihumer I am tagging you here because this will be merged soon into the main branch and then into the activelearning branch, which will force making changes (for the better) on the active learning repo. No need to check the code details, but it's worth reading the description at least to understand what will change. In a nutshell, that proxies don't need to be negative and the lower the better, but can be anything and can be controlled by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a huge amount of work in this PR! Overall, it looks pretty good to me.
gflownet/gflownet.py
Outdated
format. | ||
""" | ||
samples_final = [] | ||
max_reward = self.proxy.proxy2reward(self.proxy.min) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is mysterious to me. Not sure if I'll understand it later when I read the remaining changes so... just leaving a comment here as a reminder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh dear, good catch! This is pretty wrong and it's gonna need a bunch of changes that will delay things a bit...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this? a1462f1
@carriepl Thank you so much for the review! Would you mind taking a quick look at my comments and resolving yours if the comment is addressed. There should be one remaining issue only, if I am not mistaken. |
I just looked. It all looks great. Only that one issue remaining. |
Yes this is nice! I use it more and more, for example in the tests of the new Crystal subenvs.
On 21 May 2024 12:59:01 GMT-04:00, carriepl ***@***.***> wrote:
@carriepl commented on this pull request.
> + 1.0,
+ 8.1031e03,
+ ],
+ [-11, -2, -1.5, -1.1, -1, -0.9, -0.5, 0, 9],
+ ),
+ ],
+)
+def test_reward_function_callable__behaves_as_expected(
+ proxy_callable,
+ reward_function,
+ logreward_function,
+ proxy_values,
+ rewards_exp,
+ logrewards_exp,
+):
+ proxy = proxy_callable
Ok, I wasn't aware that Pytest fixtures could implicitely use the test parameters. Good to know!
--
Reply to this email directly or view it on GitHub:
#299 (comment)
You are receiving this because you authored the thread.
Message ID: ***@***.***>
-- Sent from /e/ Mail.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Many thanks!
This PR refactors several important aspects of the code base:
Copies of environments
Each trajectory in a GFlowNet batch is constructed by evolving one environment, either forwards or backwards. So far, to construct each trajectory, a reference environment was copied and reset. This PR replaces this copying operation by storing an "environment maker" as an attribute of the agent (as though it was just the environment class) and instead creating a new environment instance for each trajectory.
Detachment of environment and proxy
This addresses Issue #288.
So far, the proxy has been part of GFlowNet environments as an attribute. The rationale was to be able to compute the reward of a trajectory directly from the environment, for example with
env.reward()
. While this is a nice feature, it comes at a cost and it has prevented extending the flexibility to use different proxy-to-reward functions, a better handling of the computation of log-rewards for better numerical stability, as well the simple use of alternative baselines to GFlowNets.In this PR, the proxy and the environments have been detached and become (almost fully) independent of each other. Together with this change, I have also tried to improve the implementation for handling the conversion of proxy values to rewards or log rewards, to make it for efficient, numerically more stable and more flexible to extend.
These are some of the new features:
self.reward_function
.Dropping convention of range of values in the proxies
So far, we (sort of) followed a convention by which lower proxy values (and negative) were better. This has created a lot of confusion because not all proxies make much sense in this convention.
With this PR, this convention is dropped altogether and the guiding principle now is that the user is free to create any proxy with any range of values and any direction (maximisation or minimisation), but they are responsible to use an appropriate proxy-to-reward function. This is now much easier given the changes described in the previous section.
To-do list
Some of these things may be left for a separate PR.
sample_from_reward()
,compute_train_energy_proxy_and_rewards()
...) that still use the old methods in the base env.self.reward_norm
and co. in the base env.Avoid re-computing proxy values for BufferThis is left for a future PR.identity
the default.reward_func
and rest of attributes.