You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
您好,我使用两台机器每台机器8卡,且使用1个专家,top_k=1
model = FMoETransformerMLP(num_expert=1,d_model=d_model,d_hidden=d_model, world_size =torch.distributed.get_world_size(),top_k=1)
训练伪代码:
backbone_ddp = fmoe.DistributedGroupedDataParallel(model,device_ids)
....
....
backbone_ddp.allreduce_params()
optm.step()
这样应该是16*1个专家并行吧? File "/usr/local/python3.7.1/lib/python3.7/site-packages/fastmoe-1.1.0-py3.7-linux-x86_64.egg/fmoe/gates/naive_gate.py", line 33, in forward gate, k=self.top_k, dim=-1, largest=True, sorted=False RuntimeError: invalid argument 5: k not in range for dimension at /pytorch/aten/src/THC/generic/THCTensorTopK.cu:24
这是什么原因导致的呢?
The text was updated successfully, but these errors were encountered:
您好,我使用两台机器每台机器8卡,且使用1个专家,top_k=1
model = FMoETransformerMLP(num_expert=1,d_model=d_model,d_hidden=d_model, world_size =torch.distributed.get_world_size(),top_k=1)
训练伪代码:
backbone_ddp = fmoe.DistributedGroupedDataParallel(model,device_ids)
....
....
backbone_ddp.allreduce_params()
optm.step()
这样应该是16*1个专家并行吧?
File "/usr/local/python3.7.1/lib/python3.7/site-packages/fastmoe-1.1.0-py3.7-linux-x86_64.egg/fmoe/gates/naive_gate.py", line 33, in forward
gate, k=self.top_k, dim=-1, largest=True, sorted=False RuntimeError: invalid argument 5: k not in range for dimension at /pytorch/aten/src/THC/generic/THCTensorTopK.cu:24
这是什么原因导致的呢?
The text was updated successfully, but these errors were encountered: