-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
[MoE] More balanced expert sharding #21497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Woosuk Kwon <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a more balanced expert sharding strategy when the number of experts is not evenly divisible by the expert parallelism size. The previous implementation assigned all remaining experts to the last rank, which could create a performance bottleneck. The new implementation distributes the remainder experts one by one to the initial ranks, resulting in a much more even distribution. The logic is correct, clear, and improves performance by balancing the workload. I have no further comments and approve these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice :)
# Distribute experts as evenly as possible to each rank. | ||
base_experts = global_num_experts // ep_size | ||
remainder = global_num_experts % ep_size | ||
if ep_rank < remainder: | ||
local_num_experts = base_experts + 1 | ||
else: | ||
local_num_experts = base_experts | ||
|
||
# Create a tensor of size num_experts filled with -1 | ||
expert_map = torch.full((global_num_experts, ), -1, dtype=torch.int32) | ||
# Create a expert map for the local experts | ||
if ep_rank < (ep_size - 1): | ||
# Each non-last rank gets local_num_experts experts. | ||
expert_map[ep_rank * local_num_experts: | ||
(ep_rank + 1) * local_num_experts] = \ | ||
torch.arange(0, local_num_experts, dtype=torch.int32) | ||
else: | ||
# All remaining experts are assigned to the last rank. | ||
local_num_experts = (global_num_experts - ep_rank * local_num_experts) | ||
|
||
expert_map[-local_num_experts:] = \ | ||
torch.arange(0, local_num_experts, dtype=torch.int32) | ||
start_idx = ep_rank * base_experts + min(ep_rank, remainder) | ||
expert_map[start_idx:start_idx + local_num_experts] = torch.arange( | ||
0, local_num_experts, dtype=torch.int32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion for more concise version:
# Distribute experts as evenly as possible to each rank.
base_experts = global_num_experts // ep_size
remainder = global_num_experts % ep_size
start_idx = lambda r: r * base_experts + min(r, remainder)
# Create a tensor of size num_experts filled with -1
expert_map = torch.full((global_num_experts, ), -1, dtype=torch.int32)
# Create a expert map for the local experts
expert_map[start_idx(ep_rank):start_idx(ep_rank + 1)] = torch.arange(
0, local_num_experts, dtype=torch.int32)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'm not sure if the suggested code is simpler. 😅 We need to compute local_num_experts
anyways because it's one of the return values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WoosukKwon my bad, I missed that we also return that value.
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: x22x22 <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Paul Pak <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Currently, when
num_experts
is not divisible evenly byep_size
, the experts are distributed unevenly among ranks. For instance, if there are 30 experts and anep_size
of 8, the existing logic results in an unbalanced distribution:[3, 3, 3, 3, 3, 3, 3, 9]
, assigning all remainder experts to the last rank.This PR addresses this issue by distributing the experts in a more balanced manner. After applying this fix, the experts are distributed as
[4, 4, 4, 4, 4, 4, 3, 3]
.