[Perf] EPLB optimize export_load_view update #24091

dragondream-chen · 2025-09-02T11:20:38Z

Purpose

In the feature of EPLB(Experts Load Balance), the PR optimizes the update method for expert load during each forward. The current approach is using the scatter_add_ method based on topk_ids results. When using DeepEP Low-Latency or PPLX on the CUDA platform, expert loads can be obtained directly from expert_tokens_meta.expert_num_tokens, which reduces redundant calculations on the expert load.

Test Plan

Test expert load update
Since the use of the kernel, such as DeepEP Low-Latency or PPLX, leads to some changes in the inference process, the precision of intermediate values cannot be fully aligned. We directly illustrate whether the expert load update function is functioning properly by comparing the load imbalance of one layer model.
We add the following code in vllm/distributed/eplb/eplb_state.py for data collection.

if global_expert_load is None:
     physical_expert_load_window = self.expert_load_window.clone()
     global_physical_load_window = physical_expert_load_window.sum(dim=0)
     all_reduce(global_physical_load_window, group=ep_group)
if is_main_rank:
    global_num_experts = 96
    ep_size = ep_group.size()
    all_rank_node = []
    for ep_r in range(ep_size):
        base_experts = global_num_experts // ep_size
        remainder = global_num_experts % ep_size
        if er < remainder:
            local_num_experts = base_experts + 1
        else:
            local_num_experts = base_experts
        # Create a tensor of size num_experts filled with -1
        expert_map = torch.zeros(global_num_experts, device=global_physical_load_window.device, dtype=global_physical_load_window.dtype)
        # Create a expert map for the local experts
        start_idx = ep_r * base_experts + min(ep_r, remainder)
        expert_map[start_idx:start_idx + local_num_experts] = 1
        
        # [layers, phy_num] * [phy_num,]
        local_load = (global_physical_load_window * expert_map).sum(1).unsqueeze(1)
        all_rank_node.append(local_load)
    all_ranks = torch.cat(all_rank_node, dim=1).float() # [layers, ep_size]: [26, 8]
    max_ranks, _ = torch.max(all_ranks, dim=1) # [layers]
    mean_ranks = torch.mean(all_ranks, dim=1) # [layers]
    unbanlance = torch.div(max_ranks, mean_ranks) # [layers]
    logger.debug(f" | unbanlance : {unbanlance.cpu().tolist()}")
    for i in range(len(unbanlance)):
        logger.debug(f" | unbanlance layer {i} : {unbanlance[i]} ")

Test performance
Test the average time for a single update of the expert.

Test Result

The blue curve represents the state before modification, while the red curve represents the state after modification. From the graph, it can be observed that the degree of imbalance is essentially similar.

The average time for a single update of the expert:
Before modification: 0.6ms
After modification: 0.16ms

mergify · 2025-09-02T11:21:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @dragondream-chen.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces a performance optimization for the Expert Parallelism Load Balancer (EPLB) by avoiding scatter_add_ for expert load calculation when a more direct method is available. The core idea is sound and should improve performance. However, the implementation introduces code duplication and some brittle logic that could be improved for better long-term maintainability. My review focuses on refactoring these areas to make the code more robust and easier to maintain.

vllm/model_executor/layers/fused_moe/layer.py

vllm/model_executor/layers/quantization/fp8.py

gemini-code-assist

Code Review

This pull request introduces a performance optimization for EPLB by allowing expert load updates to bypass the scatter_add_ operation when using specific modular kernels. The changes are logical and well-contained. However, I've identified a critical bug in the implementation that would prevent this optimization from ever being activated. Additionally, there are a couple of high-severity maintainability issues related to code duplication and a local import that should be addressed.

vllm/model_executor/layers/fused_moe/layer.py

vllm/model_executor/layers/quantization/fp8.py

github-actions · 2025-09-02T11:27:26Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

dragondream-chen · 2025-09-03T07:06:53Z

Hi @simon-mo and @khluu,

I just submitted my first PR to VLLM. I don’t have permission to unblock additional CI tests on Buildkite (only fastcheck runs by default). Could you help add me to the vLLM Buildkite org so I can trigger full CI?

Thanks for your help!

Signed-off-by: chenmenglong <[email protected]>

mergify · 2025-09-04T14:21:27Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @dragondream-chen.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: chenmenglong <[email protected]>

…aywhu

Signed-off-by: chenmenglong <[email protected]>

dragondream-chen requested review from mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners September 2, 2025 11:20

mergify bot added the needs-rebase label Sep 2, 2025

gemini-code-assist bot reviewed Sep 2, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/quantization/fp8.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Sep 2, 2025

View reviewed changes

dragondream-chen force-pushed the main branch from ef5dd7b to ea185c7 Compare September 2, 2025 12:41

Skywalker-EP mentioned this pull request Sep 3, 2025

[RFC]: Dynamic Expert Load Balance with Zero-like-overhead #22246

Open

1 task

dragondream-chen closed this Sep 4, 2025

dragondream-chen force-pushed the main branch from c89d747 to 3724107 Compare September 4, 2025 13:54

[Optimize EPLB] optimize export_load_view update

1487300

Signed-off-by: chenmenglong <[email protected]>

dragondream-chen reopened this Sep 4, 2025

mergify bot removed the needs-rebase label Sep 4, 2025

dragondream-chen and others added 5 commits September 5, 2025 17:17

Add related notes

cdb0f97

Signed-off-by: chenmenglong <[email protected]>

Merge branch 'main' into main

f2c92e1

fix pre-commit

5aac6b6

Signed-off-by: chenmenglong <[email protected]>

Merge branch 'main' of https://github.com/dragondream-chen/vllm-raind…

2236454

…aywhu

fix pre-commit

1a35968

Signed-off-by: chenmenglong <[email protected]>

dragondream-chen force-pushed the main branch from caa2ae3 to 1a35968 Compare September 8, 2025 11:46

dragondream-chen added 2 commits September 8, 2025 20:11

Merge remote-tracking branch 'upstream/main'

b4c2940

fix pre-commit

90dd13d

Signed-off-by: chenmenglong <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Perf] EPLB optimize export_load_view update #24091

[Perf] EPLB optimize export_load_view update #24091

dragondream-chen commented Sep 2, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Sep 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Sep 2, 2025

Uh oh!

dragondream-chen commented Sep 3, 2025

Uh oh!

mergify bot commented Sep 4, 2025

Uh oh!

Uh oh!

Uh oh!

[Perf] EPLB optimize export_load_view update #24091

Are you sure you want to change the base?

[Perf] EPLB optimize export_load_view update #24091

Conversation

dragondream-chen commented Sep 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Sep 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Sep 2, 2025

Uh oh!

dragondream-chen commented Sep 3, 2025

Uh oh!

mergify bot commented Sep 4, 2025

Uh oh!

Uh oh!

dragondream-chen commented Sep 2, 2025 •

edited by github-actions bot

Loading