Skip to content

Conversation

dragondream-chen
Copy link

@dragondream-chen dragondream-chen commented Sep 2, 2025

Purpose

In the feature of EPLB(Experts Load Balance), the PR optimizes the update method for expert load during each forward. The current approach is using the scatter_add_ method based on topk_ids results. When using DeepEP Low-Latency or PPLX on the CUDA platform, expert loads can be obtained directly from expert_tokens_meta.expert_num_tokens, which reduces redundant calculations on the expert load.

Test Plan

  1. Test expert load update
    Since the use of the kernel, such as DeepEP Low-Latency or PPLX, leads to some changes in the inference process, the precision of intermediate values cannot be fully aligned. We directly illustrate whether the expert load update function is functioning properly by comparing the load imbalance of one layer model.
    We add the following code in vllm/distributed/eplb/eplb_state.py for data collection.
if global_expert_load is None:
     physical_expert_load_window = self.expert_load_window.clone()
     global_physical_load_window = physical_expert_load_window.sum(dim=0)
     all_reduce(global_physical_load_window, group=ep_group)
if is_main_rank:
    global_num_experts = 96
    ep_size = ep_group.size()
    all_rank_node = []
    for ep_r in range(ep_size):
        base_experts = global_num_experts // ep_size
        remainder = global_num_experts % ep_size
        if er < remainder:
            local_num_experts = base_experts + 1
        else:
            local_num_experts = base_experts
        # Create a tensor of size num_experts filled with -1
        expert_map = torch.zeros(global_num_experts, device=global_physical_load_window.device, dtype=global_physical_load_window.dtype)
        # Create a expert map for the local experts
        start_idx = ep_r * base_experts + min(ep_r, remainder)
        expert_map[start_idx:start_idx + local_num_experts] = 1
        
        # [layers, phy_num] * [phy_num,]
        local_load = (global_physical_load_window * expert_map).sum(1).unsqueeze(1)
        all_rank_node.append(local_load)
    all_ranks = torch.cat(all_rank_node, dim=1).float() # [layers, ep_size]: [26, 8]
    max_ranks, _ = torch.max(all_ranks, dim=1) # [layers]
    mean_ranks = torch.mean(all_ranks, dim=1) # [layers]
    unbanlance = torch.div(max_ranks, mean_ranks) # [layers]
    logger.debug(f" | unbanlance : {unbanlance.cpu().tolist()}")
    for i in range(len(unbanlance)):
        logger.debug(f" | unbanlance layer {i} : {unbanlance[i]} ")
  1. Test performance
    Test the average time for a single update of the expert.

Test Result

image The blue curve represents the state before modification, while the red curve represents the state after modification. From the graph, it can be observed that the degree of imbalance is essentially similar.

The average time for a single update of the expert:
Before modification: 0.6ms
After modification: 0.16ms

Copy link

mergify bot commented Sep 2, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @dragondream-chen.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 2, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance optimization for the Expert Parallelism Load Balancer (EPLB) by avoiding scatter_add_ for expert load calculation when a more direct method is available. The core idea is sound and should improve performance. However, the implementation introduces code duplication and some brittle logic that could be improved for better long-term maintainability. My review focuses on refactoring these areas to make the code more robust and easier to maintain.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance optimization for EPLB by allowing expert load updates to bypass the scatter_add_ operation when using specific modular kernels. The changes are logical and well-contained. However, I've identified a critical bug in the implementation that would prevent this optimization from ever being activated. Additionally, there are a couple of high-severity maintainability issues related to code duplication and a local import that should be addressed.

Copy link

github-actions bot commented Sep 2, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@dragondream-chen
Copy link
Author

Hi @simon-mo and @khluu,

I just submitted my first PR to VLLM. I don’t have permission to unblock additional CI tests on Buildkite (only fastcheck runs by default). Could you help add me to the vLLM Buildkite org so I can trigger full CI?

Thanks for your help!

Copy link

mergify bot commented Sep 4, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @dragondream-chen.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot removed the needs-rebase label Sep 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant