-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
[BugFix] Improve internal DP load balancing #21617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Nick Hill <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces significant improvements to the internal data parallelism load balancing. Key changes include:
- A more robust statistic collection mechanism in the
DPCoordinator
that batches stats by step to ensure consistency. - An improved load balancing algorithm in the
EngineCoreClient
that uses a weighted score of waiting and running requests, which is a more effective heuristic. - A fix to make local load balancing adjustments in the client effective between stat updates.
The changes are well-structured. However, I've found a critical issue in the handling of out-of-order statistics in the coordinator which could lead to state corruption. Please see the detailed comment.
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
# Conflicts: # vllm/v1/engine/async_llm.py
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Noam Gat <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Paul Pak <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Experiments showed unexpected imbalance in the internal DP load balancing, especially with large bursts of requests arriving simultaneously. There appears to be additional imbalance when using multiple API servers.
This PR adds the step count to the stats sent back from engines to the DP coordinator and a small buffer to ensure that all of the updated counts for a given step are aggregated before snap-shotting to send to the clients. This should result in a more consistent view of the point-in-time state used for balancing.
Additionally, the local request counts that are updated in the clients between updates from the coordinator and now multiplied by the total number of clients (API servers) for a better estimate of the number of new requests in each engine since the last stats update.
Testing with large-scale bursty request workloads now shows much more balanced request counts across the DP engines.