-
-
Notifications
You must be signed in to change notification settings - Fork 10k
[Perf] EPLB optimize export_load_view update #24091
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a performance optimization for the Expert Parallelism Load Balancer (EPLB) by avoiding scatter_add_
for expert load calculation when a more direct method is available. The core idea is sound and should improve performance. However, the implementation introduces code duplication and some brittle logic that could be improved for better long-term maintainability. My review focuses on refactoring these areas to make the code more robust and easier to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a performance optimization for EPLB by allowing expert load updates to bypass the scatter_add_
operation when using specific modular kernels. The changes are logical and well-contained. However, I've identified a critical bug in the implementation that would prevent this optimization from ever being activated. Additionally, there are a couple of high-severity maintainability issues related to code duplication and a local import that should be addressed.
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
ef5dd7b
to
ea185c7
Compare
c89d747
to
3724107
Compare
Signed-off-by: chenmenglong <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: chenmenglong <[email protected]>
Signed-off-by: chenmenglong <[email protected]>
Signed-off-by: chenmenglong <[email protected]>
caa2ae3
to
1a35968
Compare
Signed-off-by: chenmenglong <[email protected]>
Purpose
In the feature of EPLB(Experts Load Balance), the PR optimizes the update method for expert load during each forward. The current approach is using the
scatter_add_
method based ontopk_ids
results. When using DeepEP Low-Latency or PPLX on the CUDA platform, expert loads can be obtained directly fromexpert_tokens_meta.expert_num_tokens
, which reduces redundant calculations on the expert load.Test Plan
Since the use of the kernel, such as DeepEP Low-Latency or PPLX, leads to some changes in the inference process, the precision of intermediate values cannot be fully aligned. We directly illustrate whether the expert load update function is functioning properly by comparing the load imbalance of one layer model.
We add the following code in
vllm/distributed/eplb/eplb_state.py
for data collection.Test the average time for a single update of the expert.
Test Result
The average time for a single update of the expert:
Before modification: 0.6ms
After modification: 0.16ms