Merge pull request #89 from LucasAlegre/morld

MORL/D
LucasAlegre · Jan 29, 2024 · ebf587a · ebf587a
2 parents 9afcb3a + cc6d0fd
commit ebf587a
Show file tree

Hide file tree

Showing 23 changed files with 1,268 additions and 79 deletions.
diff --git a/.gitignore b/.gitignore
@@ -133,7 +133,7 @@ wandb/
 
 # Pycharm
 .idea/
-.DS_Store
+**/.DS_Store
 
 # Saved weights
 weights/

diff --git a/README.md b/README.md
@@ -44,18 +44,19 @@ A tutorial on MO-Gymnasium and MORL-Baselines is also available: [![Open in Cola
 
 <!-- start algos-list -->
 
-| **Name**                                                                                                                                                                 | Single/Multi-policy | ESR/SER | Observation space | Action space | Paper                                                                                                                       |
-|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------|---------|------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------|
-| [GPI-LS + GPI-PD](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/gpi_pd/gpi_pd.py)                                      | Multi               | SER     | Continuous       | Discrete / Continuous     | [Paper and Supplementary Materials](https://arxiv.org/abs/2301.07784)
-| [Envelope Q-Learning](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/envelope/envelope.py)                                      | Multi               | SER     | Continuous       | Discrete     | [Paper](https://arxiv.org/pdf/1908.08342.pdf)                                                                                        |
-| [CAPQL](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/capql/capql.py)                                      | Multi               | SER     | Continuous       | Continuous     | [Paper](https://openreview.net/pdf?id=TjEzIsyEsQ6)                                                                                        |
-| [PGMORL](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/pgmorl/pgmorl.py) <sup>[1](#f1)</sup>                                                     | Multi               | SER     | Continuous       | Continuous   | [Paper](https://people.csail.mit.edu/jiex/papers/PGMORL/paper.pdf) / [Supplementary Materials](https://people.csail.mit.edu/jiex/papers/PGMORL/supp.pdf)        |
-| [Pareto Conditioned Networks (PCN)](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/pcn/pcn.py)                                      | Multi               | SER/ESR <sup>[2](#f2)</sup>      | Continuous       | Discrete / Continuous    | [Paper](https://www.ifaamas.org/Proceedings/aamas2022/pdfs/p1110.pdf)                                                          |
-| [Pareto Q-Learning](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/pareto_q_learning/pql.py)                                    | Multi               | SER     | Discrete         | Discrete     | [Paper](https://jmlr.org/papers/volume15/vanmoffaert14a/vanmoffaert14a.pdf)                                                          |
-| [MO Q learning](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/single_policy/ser/mo_q_learning.py)                                           | Single              | SER     | Discrete         | Discrete     | [Paper](https://www.researchgate.net/publication/235698665_Scalarized_Multi-Objective_Reinforcement_Learning_Novel_Design_Techniques)                                                                                                                             |
-| [MPMOQLearning](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/multi_policy_moqlearning/mp_mo_q_learning.py)  (outer loop MOQL) | Multi               | SER     | Discrete         | Discrete     | [Paper](https://www.researchgate.net/publication/235698665_Scalarized_Multi-Objective_Reinforcement_Learning_Novel_Design_Techniques) |
-| [Optimistic Linear Support (OLS)](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/ols/ols.py)                                    | Multi               | SER     | /                | /            | Section 3.3 of the [thesis](http://roijers.info/pub/thesis.pdf)     |
-| [Expected Utility Policy Gradient (EUPG)](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/single_policy/esr/eupg.py)                          | Single              | ESR     | Discrete         | Discrete     |   [Paper](https://www.researchgate.net/publication/328718263_Multi-objective_Reinforcement_Learning_for_the_Expected_Utility_of_the_Return)                                                   |
+| **Name**                                                                                                                                                             | Single/Multi-policy | ESR/SER                     | Observation space | Action space          | Paper                                                                                                                                                    |
+|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------|-----------------------------|-------------------|-----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [GPI-LS + GPI-PD](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/gpi_pd/gpi_pd.py)                                              | Multi               | SER                         | Continuous        | Discrete / Continuous | [Paper and Supplementary Materials](https://arxiv.org/abs/2301.07784)                                                                                    |
+| [MORL/D](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/morld/morld.py)                                                         | Multi               | /                           | /                 | /                     | [Paper](https://arxiv.org/abs/2311.12495)                                                                                                                |
+| [Envelope Q-Learning](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/envelope/envelope.py)                                      | Multi               | SER                         | Continuous        | Discrete              | [Paper](https://arxiv.org/pdf/1908.08342.pdf)                                                                                                            |
+| [CAPQL](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/capql/capql.py)                                                          | Multi               | SER                         | Continuous        | Continuous            | [Paper](https://openreview.net/pdf?id=TjEzIsyEsQ6)                                                                                                       |
+| [PGMORL](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/pgmorl/pgmorl.py) <sup>[1](#f1)</sup>                                   | Multi               | SER                         | Continuous        | Continuous            | [Paper](https://people.csail.mit.edu/jiex/papers/PGMORL/paper.pdf) / [Supplementary Materials](https://people.csail.mit.edu/jiex/papers/PGMORL/supp.pdf) |
+| [Pareto Conditioned Networks (PCN)](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/pcn/pcn.py)                                  | Multi               | SER/ESR <sup>[2](#f2)</sup> | Continuous        | Discrete / Continuous | [Paper](https://www.ifaamas.org/Proceedings/aamas2022/pdfs/p1110.pdf)                                                                                    |
+| [Pareto Q-Learning](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/pareto_q_learning/pql.py)                                    | Multi               | SER                         | Discrete          | Discrete              | [Paper](https://jmlr.org/papers/volume15/vanmoffaert14a/vanmoffaert14a.pdf)                                                                              |
+| [MO Q learning](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/single_policy/ser/mo_q_learning.py)                                           | Single              | SER                         | Discrete          | Discrete              | [Paper](https://www.researchgate.net/publication/235698665_Scalarized_Multi-Objective_Reinforcement_Learning_Novel_Design_Techniques)                    |
+| [MPMOQLearning](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/multi_policy_moqlearning/mp_mo_q_learning.py)  (outer loop MOQL) | Multi               | SER                         | Discrete          | Discrete              | [Paper](https://www.researchgate.net/publication/235698665_Scalarized_Multi-Objective_Reinforcement_Learning_Novel_Design_Techniques)                    |
+| [Optimistic Linear Support (OLS)](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/multi_policy/ols/ols.py)                                    | Multi               | SER                         | /                 | /                     | Section 3.3 of the [thesis](http://roijers.info/pub/thesis.pdf)                                                                                          |
+| [Expected Utility Policy Gradient (EUPG)](https://github.com/LucasAlegre/morl-baselines/blob/main/morl_baselines/single_policy/esr/eupg.py)                          | Single              | ESR                         | Discrete          | Discrete              | [Paper](https://www.researchgate.net/publication/328718263_Multi-objective_Reinforcement_Learning_for_the_Expected_Utility_of_the_Return)                |
 
 :warning: Some of the algorithms have limited features.
 

diff --git a/docs/algos/multi_policy.md b/docs/algos/multi_policy.md
@@ -6,6 +6,7 @@ multi_policy/gpi_pd
 multi_policy/envelope
 multi_policy/capql
 multi_policy/pgmorl
+multi_policy/morld
 multi_policy/pcn
 multi_policy/pareto_q_learning
 multi_policy/mp_mo_q_learning

diff --git a/docs/algos/multi_policy/morld.md b/docs/algos/multi_policy/morld.md
@@ -0,0 +1,11 @@
+# MORL/D
+
+Multi-Objective Reinforcement Learning based on Decomposition. The idea of this framework is to decompose the multi-objective problem into a set of single-objective problems. The single-objective problems are then solved by a single-objective RL algorithm (or something close). There are multiple tricks which can be applied to improve the sample efficiency when compared to just sequentially solving each single-objective RL problem.
+
+See the paper [Multi-Objective Reinforcement Learning based on Decomposition](https://arxiv.org/abs/2311.12495) for more details.
+
+
+```{eval-rst}
+.. autoclass:: morl_baselines.multi_policy.morld.morld.MORLD
+    :members:
+```
diff --git a/examples/.DS_Store b/examples/.DS_Store
diff --git a/examples/eupg_fishwood.py b/examples/eupg_fishwood.py
@@ -10,10 +10,10 @@
     env = MORecordEpisodeStatistics(mo_gym.make("fishwood-v0"), gamma=0.99)
     eval_env = mo_gym.make("fishwood-v0")
 
-    def scalarization(reward: np.ndarray):
+    def scalarization(reward: np.ndarray, w):
         return min(reward[0], reward[1] // 2)
 
-    agent = EUPG(env, scalarization=scalarization, gamma=0.99, log=True, learning_rate=0.001)
+    agent = EUPG(env, scalarization=scalarization, weights=np.ones(2), gamma=0.99, log=True, learning_rate=0.001)
     agent.train(total_timesteps=int(4e6), eval_env=eval_env, eval_freq=1000)
 
     print(eval_mo_reward_conditioned(agent, env=eval_env, scalarization=scalarization))
diff --git a/examples/morld_cheetah.py b/examples/morld_cheetah.py
@@ -0,0 +1,40 @@
+import mo_gymnasium as mo_gym
+import numpy as np
+import torch  # noqa: F401
+
+from morl_baselines.multi_policy.morld.morld import MORLD
+
+
+def main():
+    gamma = 0.99
+
+    env = mo_gym.make("mo-halfcheetah-v4")
+    eval_env = mo_gym.make("mo-halfcheetah-v4")
+
+    algo = MORLD(
+        env=env,
+        exchange_every=int(5e4),
+        pop_size=6,
+        policy_name="MOSAC",
+        scalarization_method="ws",
+        evaluation_mode="ser",
+        gamma=gamma,
+        log=False,
+        neighborhood_size=1,
+        update_passes=10,
+        shared_buffer=True,
+        sharing_mechanism=[],
+        weight_adaptation_method="PSA",
+        seed=0,
+    )
+
+    algo.train(
+        eval_env=eval_env,
+        total_timesteps=int(3e6) + 1,
+        ref_point=np.array([-100.0, -100.0]),
+        known_pareto_front=None,
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/morld_hopper.py b/examples/morld_hopper.py
@@ -0,0 +1,40 @@
+import mo_gymnasium as mo_gym
+import numpy as np
+import torch  # noqa: F401
+
+from morl_baselines.multi_policy.morld.morld import MORLD
+
+
+def main():
+    gamma = 0.99
+
+    env = mo_gym.make("mo-hopper-v4")
+    eval_env = mo_gym.make("mo-hopper-v4")
+
+    algo = MORLD(
+        env=env,
+        exchange_every=int(5e4),
+        pop_size=6,
+        policy_name="MOSAC",
+        scalarization_method="ws",
+        evaluation_mode="ser",
+        gamma=gamma,
+        log=True,
+        neighborhood_size=1,
+        update_passes=10,
+        shared_buffer=True,
+        sharing_mechanism=[],
+        weight_adaptation_method=None,
+        seed=0,
+    )
+
+    algo.train(
+        eval_env=eval_env,
+        total_timesteps=int(8e6) + 1,
+        ref_point=np.array([-100.0, -100.0, -100.0]),
+        known_pareto_front=None,
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/morl_baselines/common/accrued_reward_buffer.py b/morl_baselines/common/accrued_reward_buffer.py
@@ -88,21 +88,17 @@ def cleanup(self):
         """Cleanup the buffer."""
         self.size, self.ptr = 0, 0
 
-    def get_all_data(self, max_samples=None, to_tensor=False, device=None):
-        """Returns the whole buffer (with a specified maximum number of samples).
+    def get_all_data(self, to_tensor=False, device=None):
+        """Returns the whole buffer.
 
         Args:
-            max_samples: the number of samples to return, if not specified, returns the full buffer (ordered!)
             to_tensor: Whether to convert the data to tensors or not
             device: Device to use for the tensors
 
         Returns:
             Tuple of (obs, accrued_rewards, actions, rewards, next_obs, dones)
         """
-        if max_samples is not None:
-            inds = np.random.choice(self.size, min(max_samples, self.size), replace=False)
-        else:
-            inds = np.arange(self.size)
+        inds = np.arange(self.size)
         experience_tuples = (
             self.obs[inds],
             self.accrued_rewards[inds],

diff --git a/morl_baselines/common/evaluation.py b/morl_baselines/common/evaluation.py
@@ -88,7 +88,7 @@ def eval_mo_reward_conditioned(
     """
     obs, _ = env.reset()
     done = False
-    vec_return, disc_vec_return = np.zeros(env.reward_space.shape[0]), np.zeros(env.reward_space.shape[0])
+    vec_return, disc_vec_return = np.zeros(env.unwrapped.reward_space.shape[0]), np.zeros(env.unwrapped.reward_space.shape[0])
     gamma = 1.0
     while not done:
         if render:
@@ -102,8 +102,9 @@ def eval_mo_reward_conditioned(
         scalarized_return = scalarization(vec_return)
         scalarized_discounted_return = scalarization(disc_vec_return)
     else:
-        scalarized_return = scalarization(w, vec_return)
-        scalarized_discounted_return = scalarization(w, disc_vec_return)
+        # watch out with the order here
+        scalarized_return = scalarization(vec_return, w)
+        scalarized_discounted_return = scalarization(disc_vec_return, w)
 
     return (
         scalarized_return,
@@ -113,19 +114,22 @@ def eval_mo_reward_conditioned(
     )
 
 
-def policy_evaluation_mo(agent, env, w: np.ndarray, rep: int = 5) -> Tuple[float, float, np.ndarray, np.ndarray]:
+def policy_evaluation_mo(
+    agent, env, w: np.ndarray, scalarization=np.dot, rep: int = 5
+) -> Tuple[float, float, np.ndarray, np.ndarray]:
     """Evaluates the value of a policy by running the policy for multiple episodes. Returns the average returns.
 
     Args:
         agent: Agent
         env: MO-Gymnasium environment
         w (np.ndarray): Weight vector
+        scalarization: scalarization function, taking reward and weight as parameters
         rep (int, optional): Number of episodes for averaging. Defaults to 5.
 
     Returns:
         (float, float, np.ndarray, np.ndarray): Avg scalarized return, Avg scalarized discounted return, Avg vectorized return, Avg vectorized discounted return
     """
-    evals = [eval_mo(agent, env, w) for _ in range(rep)]
+    evals = [eval_mo(agent=agent, env=env, w=w, scalarization=scalarization) for _ in range(rep)]
     avg_scalarized_return = np.mean([eval[0] for eval in evals])
     avg_scalarized_discounted_return = np.mean([eval[1] for eval in evals])
     avg_vec_return = np.mean([eval[2] for eval in evals], axis=0)

diff --git a/morl_baselines/common/experiments.py b/morl_baselines/common/experiments.py
@@ -8,6 +8,7 @@
     GPILSContinuousAction,
     GPIPDContinuousAction,
 )
+from morl_baselines.multi_policy.morld.morld import MORLD
 from morl_baselines.multi_policy.multi_policy_moqlearning.mp_mo_q_learning import (
     MPMOQLearning,
 )
@@ -29,6 +30,7 @@
     "pql": PQL,
     "ols": MPMOQLearning,
     "gpi-ls": MPMOQLearning,
+    "morld": MORLD,
 }
 
 ENVS_WITH_KNOWN_PARETO_FRONT = [