Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-node problem #197

Open
Qianshaowei opened this issue Mar 4, 2024 · 1 comment
Open

multi-node problem #197

Qianshaowei opened this issue Mar 4, 2024 · 1 comment

Comments

@Qianshaowei
Copy link

您好,我使用两台机器每台机器8卡,且使用1个专家,top_k=1
model = FMoETransformerMLP(num_expert=1,d_model=d_model,d_hidden=d_model, world_size =torch.distributed.get_world_size(),top_k=1)

训练伪代码:
backbone_ddp = fmoe.DistributedGroupedDataParallel(model,device_ids)
....
....
backbone_ddp.allreduce_params()
optm.step()
这样应该是16*1个专家并行吧?
File "/usr/local/python3.7.1/lib/python3.7/site-packages/fastmoe-1.1.0-py3.7-linux-x86_64.egg/fmoe/gates/naive_gate.py", line 33, in forward
gate, k=self.top_k, dim=-1, largest=True, sorted=False RuntimeError: invalid argument 5: k not in range for dimension at /pytorch/aten/src/THC/generic/THCTensorTopK.cu:24
这是什么原因导致的呢?

@laekov
Copy link
Owner

laekov commented Mar 11, 2024

这个问题的原因看起来跑到 naive_gate.py:33 这里的时候 k 变成 5 了, 比较奇怪. 您可以在 python 里找一下这个 k 是在哪里变成 5 的吗? 谢谢.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants