Skip to content

Commit

Permalink
Merge pull request #89 from LucasAlegre/morld
Browse files Browse the repository at this point in the history
MORL/D
  • Loading branch information
ffelten authored Jan 29, 2024
2 parents 9afcb3a + cc6d0fd commit ebf587a
Show file tree
Hide file tree
Showing 23 changed files with 1,268 additions and 79 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ wandb/

# Pycharm
.idea/
.DS_Store
**/.DS_Store

# Saved weights
weights/
Expand Down
25 changes: 13 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,18 +44,19 @@ A tutorial on MO-Gymnasium and MORL-Baselines is also available: [![Open in Cola

<!-- start algos-list -->

| **Name** | Single/Multi-policy | ESR/SER | Observation space | Action space | Paper |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------|---------|------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------|
| [GPI-LS + GPI-PD](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/gpi_pd/gpi_pd.py) | Multi | SER | Continuous | Discrete / Continuous | [Paper and Supplementary Materials](https://arxiv.org/abs/2301.07784)
| [Envelope Q-Learning](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/envelope/envelope.py) | Multi | SER | Continuous | Discrete | [Paper](https://arxiv.org/pdf/1908.08342.pdf) |
| [CAPQL](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/capql/capql.py) | Multi | SER | Continuous | Continuous | [Paper](https://openreview.net/pdf?id=TjEzIsyEsQ6) |
| [PGMORL](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/pgmorl/pgmorl.py) <sup>[1](#f1)</sup> | Multi | SER | Continuous | Continuous | [Paper](https://people.csail.mit.edu/jiex/papers/PGMORL/paper.pdf) / [Supplementary Materials](https://people.csail.mit.edu/jiex/papers/PGMORL/supp.pdf) |
| [Pareto Conditioned Networks (PCN)](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/pcn/pcn.py) | Multi | SER/ESR <sup>[2](#f2)</sup> | Continuous | Discrete / Continuous | [Paper](https://www.ifaamas.org/Proceedings/aamas2022/pdfs/p1110.pdf) |
| [Pareto Q-Learning](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/pareto_q_learning/pql.py) | Multi | SER | Discrete | Discrete | [Paper](https://jmlr.org/papers/volume15/vanmoffaert14a/vanmoffaert14a.pdf) |
| [MO Q learning](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/single_policy/ser/mo_q_learning.py) | Single | SER | Discrete | Discrete | [Paper](https://www.researchgate.net/publication/235698665_Scalarized_Multi-Objective_Reinforcement_Learning_Novel_Design_Techniques) |
| [MPMOQLearning](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/multi_policy_moqlearning/mp_mo_q_learning.py) (outer loop MOQL) | Multi | SER | Discrete | Discrete | [Paper](https://www.researchgate.net/publication/235698665_Scalarized_Multi-Objective_Reinforcement_Learning_Novel_Design_Techniques) |
| [Optimistic Linear Support (OLS)](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/ols/ols.py) | Multi | SER | / | / | Section 3.3 of the [thesis](http://roijers.info/pub/thesis.pdf) |
| [Expected Utility Policy Gradient (EUPG)](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/single_policy/esr/eupg.py) | Single | ESR | Discrete | Discrete | [Paper](https://www.researchgate.net/publication/328718263_Multi-objective_Reinforcement_Learning_for_the_Expected_Utility_of_the_Return) |
| **Name** | Single/Multi-policy | ESR/SER | Observation space | Action space | Paper |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------|-----------------------------|-------------------|-----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
| [GPI-LS + GPI-PD](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/gpi_pd/gpi_pd.py) | Multi | SER | Continuous | Discrete / Continuous | [Paper and Supplementary Materials](https://arxiv.org/abs/2301.07784) |
| [MORL/D](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/morld/morld.py) | Multi | / | / | / | [Paper](https://arxiv.org/abs/2311.12495) |
| [Envelope Q-Learning](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/envelope/envelope.py) | Multi | SER | Continuous | Discrete | [Paper](https://arxiv.org/pdf/1908.08342.pdf) |
| [CAPQL](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/capql/capql.py) | Multi | SER | Continuous | Continuous | [Paper](https://openreview.net/pdf?id=TjEzIsyEsQ6) |
| [PGMORL](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/pgmorl/pgmorl.py) <sup>[1](#f1)</sup> | Multi | SER | Continuous | Continuous | [Paper](https://people.csail.mit.edu/jiex/papers/PGMORL/paper.pdf) / [Supplementary Materials](https://people.csail.mit.edu/jiex/papers/PGMORL/supp.pdf) |
| [Pareto Conditioned Networks (PCN)](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/pcn/pcn.py) | Multi | SER/ESR <sup>[2](#f2)</sup> | Continuous | Discrete / Continuous | [Paper](https://www.ifaamas.org/Proceedings/aamas2022/pdfs/p1110.pdf) |
| [Pareto Q-Learning](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/pareto_q_learning/pql.py) | Multi | SER | Discrete | Discrete | [Paper](https://jmlr.org/papers/volume15/vanmoffaert14a/vanmoffaert14a.pdf) |
| [MO Q learning](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/single_policy/ser/mo_q_learning.py) | Single | SER | Discrete | Discrete | [Paper](https://www.researchgate.net/publication/235698665_Scalarized_Multi-Objective_Reinforcement_Learning_Novel_Design_Techniques) |
| [MPMOQLearning](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/multi_policy_moqlearning/mp_mo_q_learning.py) (outer loop MOQL) | Multi | SER | Discrete | Discrete | [Paper](https://www.researchgate.net/publication/235698665_Scalarized_Multi-Objective_Reinforcement_Learning_Novel_Design_Techniques) |
| [Optimistic Linear Support (OLS)](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/ols/ols.py) | Multi | SER | / | / | Section 3.3 of the [thesis](http://roijers.info/pub/thesis.pdf) |
| [Expected Utility Policy Gradient (EUPG)](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/single_policy/esr/eupg.py) | Single | ESR | Discrete | Discrete | [Paper](https://www.researchgate.net/publication/328718263_Multi-objective_Reinforcement_Learning_for_the_Expected_Utility_of_the_Return) |

:warning: Some of the algorithms have limited features.

Expand Down
1 change: 1 addition & 0 deletions docs/algos/multi_policy.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ multi_policy/gpi_pd
multi_policy/envelope
multi_policy/capql
multi_policy/pgmorl
multi_policy/morld
multi_policy/pcn
multi_policy/pareto_q_learning
multi_policy/mp_mo_q_learning
Expand Down
11 changes: 11 additions & 0 deletions docs/algos/multi_policy/morld.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# MORL/D

Multi-Objective Reinforcement Learning based on Decomposition. The idea of this framework is to decompose the multi-objective problem into a set of single-objective problems. The single-objective problems are then solved by a single-objective RL algorithm (or something close). There are multiple tricks which can be applied to improve the sample efficiency when compared to just sequentially solving each single-objective RL problem.

See the paper [Multi-Objective Reinforcement Learning based on Decomposition](https://arxiv.org/abs/2311.12495) for more details.


```{eval-rst}
.. autoclass:: morl_baselines.multi_policy.morld.morld.MORLD
:members:
```
Binary file removed examples/.DS_Store
Binary file not shown.
4 changes: 2 additions & 2 deletions examples/eupg_fishwood.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,10 @@
env = MORecordEpisodeStatistics(mo_gym.make("fishwood-v0"), gamma=0.99)
eval_env = mo_gym.make("fishwood-v0")

def scalarization(reward: np.ndarray):
def scalarization(reward: np.ndarray, w):
return min(reward[0], reward[1] // 2)

agent = EUPG(env, scalarization=scalarization, gamma=0.99, log=True, learning_rate=0.001)
agent = EUPG(env, scalarization=scalarization, weights=np.ones(2), gamma=0.99, log=True, learning_rate=0.001)
agent.train(total_timesteps=int(4e6), eval_env=eval_env, eval_freq=1000)

print(eval_mo_reward_conditioned(agent, env=eval_env, scalarization=scalarization))
40 changes: 40 additions & 0 deletions examples/morld_cheetah.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import mo_gymnasium as mo_gym
import numpy as np
import torch # noqa: F401

from morl_baselines.multi_policy.morld.morld import MORLD


def main():
gamma = 0.99

env = mo_gym.make("mo-halfcheetah-v4")
eval_env = mo_gym.make("mo-halfcheetah-v4")

algo = MORLD(
env=env,
exchange_every=int(5e4),
pop_size=6,
policy_name="MOSAC",
scalarization_method="ws",
evaluation_mode="ser",
gamma=gamma,
log=False,
neighborhood_size=1,
update_passes=10,
shared_buffer=True,
sharing_mechanism=[],
weight_adaptation_method="PSA",
seed=0,
)

algo.train(
eval_env=eval_env,
total_timesteps=int(3e6) + 1,
ref_point=np.array([-100.0, -100.0]),
known_pareto_front=None,
)


if __name__ == "__main__":
main()
40 changes: 40 additions & 0 deletions examples/morld_hopper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import mo_gymnasium as mo_gym
import numpy as np
import torch # noqa: F401

from morl_baselines.multi_policy.morld.morld import MORLD


def main():
gamma = 0.99

env = mo_gym.make("mo-hopper-v4")
eval_env = mo_gym.make("mo-hopper-v4")

algo = MORLD(
env=env,
exchange_every=int(5e4),
pop_size=6,
policy_name="MOSAC",
scalarization_method="ws",
evaluation_mode="ser",
gamma=gamma,
log=True,
neighborhood_size=1,
update_passes=10,
shared_buffer=True,
sharing_mechanism=[],
weight_adaptation_method=None,
seed=0,
)

algo.train(
eval_env=eval_env,
total_timesteps=int(8e6) + 1,
ref_point=np.array([-100.0, -100.0, -100.0]),
known_pareto_front=None,
)


if __name__ == "__main__":
main()
10 changes: 3 additions & 7 deletions morl_baselines/common/accrued_reward_buffer.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,21 +88,17 @@ def cleanup(self):
"""Cleanup the buffer."""
self.size, self.ptr = 0, 0

def get_all_data(self, max_samples=None, to_tensor=False, device=None):
"""Returns the whole buffer (with a specified maximum number of samples).
def get_all_data(self, to_tensor=False, device=None):
"""Returns the whole buffer.
Args:
max_samples: the number of samples to return, if not specified, returns the full buffer (ordered!)
to_tensor: Whether to convert the data to tensors or not
device: Device to use for the tensors
Returns:
Tuple of (obs, accrued_rewards, actions, rewards, next_obs, dones)
"""
if max_samples is not None:
inds = np.random.choice(self.size, min(max_samples, self.size), replace=False)
else:
inds = np.arange(self.size)
inds = np.arange(self.size)
experience_tuples = (
self.obs[inds],
self.accrued_rewards[inds],
Expand Down
14 changes: 9 additions & 5 deletions morl_baselines/common/evaluation.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ def eval_mo_reward_conditioned(
"""
obs, _ = env.reset()
done = False
vec_return, disc_vec_return = np.zeros(env.reward_space.shape[0]), np.zeros(env.reward_space.shape[0])
vec_return, disc_vec_return = np.zeros(env.unwrapped.reward_space.shape[0]), np.zeros(env.unwrapped.reward_space.shape[0])
gamma = 1.0
while not done:
if render:
Expand All @@ -102,8 +102,9 @@ def eval_mo_reward_conditioned(
scalarized_return = scalarization(vec_return)
scalarized_discounted_return = scalarization(disc_vec_return)
else:
scalarized_return = scalarization(w, vec_return)
scalarized_discounted_return = scalarization(w, disc_vec_return)
# watch out with the order here
scalarized_return = scalarization(vec_return, w)
scalarized_discounted_return = scalarization(disc_vec_return, w)

return (
scalarized_return,
Expand All @@ -113,19 +114,22 @@ def eval_mo_reward_conditioned(
)


def policy_evaluation_mo(agent, env, w: np.ndarray, rep: int = 5) -> Tuple[float, float, np.ndarray, np.ndarray]:
def policy_evaluation_mo(
agent, env, w: np.ndarray, scalarization=np.dot, rep: int = 5
) -> Tuple[float, float, np.ndarray, np.ndarray]:
"""Evaluates the value of a policy by running the policy for multiple episodes. Returns the average returns.
Args:
agent: Agent
env: MO-Gymnasium environment
w (np.ndarray): Weight vector
scalarization: scalarization function, taking reward and weight as parameters
rep (int, optional): Number of episodes for averaging. Defaults to 5.
Returns:
(float, float, np.ndarray, np.ndarray): Avg scalarized return, Avg scalarized discounted return, Avg vectorized return, Avg vectorized discounted return
"""
evals = [eval_mo(agent, env, w) for _ in range(rep)]
evals = [eval_mo(agent=agent, env=env, w=w, scalarization=scalarization) for _ in range(rep)]
avg_scalarized_return = np.mean([eval[0] for eval in evals])
avg_scalarized_discounted_return = np.mean([eval[1] for eval in evals])
avg_vec_return = np.mean([eval[2] for eval in evals], axis=0)
Expand Down
2 changes: 2 additions & 0 deletions morl_baselines/common/experiments.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
GPILSContinuousAction,
GPIPDContinuousAction,
)
from morl_baselines.multi_policy.morld.morld import MORLD
from morl_baselines.multi_policy.multi_policy_moqlearning.mp_mo_q_learning import (
MPMOQLearning,
)
Expand All @@ -29,6 +30,7 @@
"pql": PQL,
"ols": MPMOQLearning,
"gpi-ls": MPMOQLearning,
"morld": MORLD,
}

ENVS_WITH_KNOWN_PARETO_FRONT = [
Expand Down
Loading

0 comments on commit ebf587a

Please sign in to comment.