-
Notifications
You must be signed in to change notification settings - Fork 473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support parallel reward function #575
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention:
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #575 +/- ##
==========================================
+ Coverage 43.58% 43.62% +0.03%
==========================================
Files 33 33
Lines 4974 4997 +23
==========================================
+ Hits 2168 2180 +12
- Misses 2806 2817 +11 ☔ View full report in Codecov by Sentry. |
@Jingru ,Hi Jingru, any updates here? I am also looking for this approach. Thanks! |
Actually, I am bit confused about, why they have to separate the reward model to the last GPU? The reward model could not do the parallel like the policy model? |
I believe this pr could solve this issue. I've already tested it myself. Maybe this abstraction of reward "function" rather than reward "model" is the reason why it is only invoked in ONE process/GPU. |
Another improvement may be that the hydra lora architecture could also share the same foundation frozen model weights between the "reward model" as well as the "actor model" and the "critic model" |
Thanks! If I understand correctly and suppose we got two processes 0,1. For each process, it just go through the distributed model to get the generation, and then it would get the reward for each process as well. Why do they need to separate them and use some tricks like gather, broadcast etc... I am confused about it :< Much appreciation if you have time to reply |
For example, I stop my debugger at this point: trlx/trlx/trainer/accelerate_base_trainer.py Line 280 in 91a0f43
But find that self.model(the whole model instead of the partioned one) would have a different devices for different processes. That's not the case, right? As here said that zero-DP would split the model into different slices. |
This is expected behavior of deepspeed: every process has a model (wrapper), and each wrapper only has a shard of model weights if zero3 is enabled. Besides, inputs(prompts) and outputs are also sharded between different processes (for DP). That's why |
Hi @Jingru! Thanks for the work you've done here! Could you share the script with a parallel or sharded reward model you had used? |
42a91c4
to
43ea9f1
Compare
I borrowed reward model initialization function from deepspeed-chat and defined reward_fn like this:
I believe we can just use |
Looks like reward in parallel is more efficient? As we don't need to gather or broadcast to because the reward model is on the last device anymore.. |
But how much VRAM does stage 3 need? Reason for why I ask about it is that I just found that the model would be put on each device without partition when stage 3 is enabled. Probably they only do partition during the forward function. But if models are already fitted in two devices, why we need partition anymore.. The code I test is from HF(pls see line I print for device and parameters):
|
You can set |
Thanks for this! @Jingru . I carefully checked it again and feel like this is already being shared as the way --zero3_init_flag would do, if you see the comments in the core: '
when model's it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name) otherwise the model will first be loaded normally and only partitioned at forward time which is Besides, I watch the parameter of the model after from_pretrained() |
You're right about Deepspeed has its own VRAM manager and tensors in model are just placeholders, so they are empty after model initialization. I'm not familiar with model sharding details of deepseed zero3 and you may check the implementation of deepspeed about the imbalance of model sharding. As for VRAM occupation, I guess it contains optimizer states for every parameter and they may be float32 if you're using mixed precision. |
4c38538
to
c292e3d
Compare
Oh If I understand correctly, accelerate could not use prepare twice in a script, so that's why you use deepspeed.initialize |
c292e3d
to
d283ee2
Compare
Currently, reward_fn is invoked only in main_process, which will hang if the reward_fn is actually a parallel model (like: TP/PP/Zero optimized one).