[BugFix] Improve internal DP load balancing #21617

njhill · 2025-07-25T14:37:54Z

Experiments showed unexpected imbalance in the internal DP load balancing, especially with large bursts of requests arriving simultaneously. There appears to be additional imbalance when using multiple API servers.

This PR adds the step count to the stats sent back from engines to the DP coordinator and a small buffer to ensure that all of the updated counts for a given step are aggregated before snap-shotting to send to the clients. This should result in a more consistent view of the point-in-time state used for balancing.

Additionally, the local request counts that are updated in the clients between updates from the coordinator and now multiplied by the total number of clients (API servers) for a better estimate of the number of new requests in each engine since the last stats update.

Testing with large-scale bursty request workloads now shows much more balanced request counts across the DP engines.

Signed-off-by: Nick Hill <[email protected]>

github-actions · 2025-07-25T14:38:42Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces significant improvements to the internal data parallelism load balancing. Key changes include:

A more robust statistic collection mechanism in the DPCoordinator that batches stats by step to ensure consistency.
An improved load balancing algorithm in the EngineCoreClient that uses a weighted score of waiting and running requests, which is a more effective heuristic.
A fix to make local load balancing adjustments in the client effective between stat updates.

The changes are well-structured. However, I've found a critical issue in the handling of out-of-order statistics in the coordinator which could lead to state corruption. Please see the detailed comment.

vllm/v1/engine/coordinator.py

Signed-off-by: Nick Hill <[email protected]>

mergify · 2025-08-01T16:42:03Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

# Conflicts: # vllm/v1/engine/async_llm.py

Signed-off-by: Nick Hill <[email protected]>

…alancing

Signed-off-by: Nick Hill <[email protected]>

Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>

Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Noam Gat <[email protected]>

Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Paul Pak <[email protected]>

Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Diego-Castan <[email protected]>

Signed-off-by: Nick Hill <[email protected]>

[BugFix] Improve internal DP load balancing

8177e2f

Signed-off-by: Nick Hill <[email protected]>

mergify bot added the v1 label Jul 25, 2025

gemini-code-assist bot reviewed Jul 25, 2025

View reviewed changes

vllm/v1/engine/coordinator.py Show resolved Hide resolved

tlrmchlsmth reviewed Jul 25, 2025

View reviewed changes

vllm/v1/engine/coordinator.py Outdated Show resolved Hide resolved

improvements, particularly for multi-api-server case

d7ab219

Signed-off-by: Nick Hill <[email protected]>

mergify bot added the frontend label Jul 28, 2025

add some comments

a53327a

Signed-off-by: Nick Hill <[email protected]>

njhill marked this pull request as ready for review July 28, 2025 15:17

njhill requested review from WoosukKwon, robertgshaw2-redhat, ywang96, comaniac, alexm-redhat and aarnphm as code owners July 28, 2025 15:17

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 29, 2025

mergify bot added the needs-rebase label Aug 1, 2025

Merge remote-tracking branch 'origin/main' into fix-dp-balancing

6430b09

# Conflicts: # vllm/v1/engine/async_llm.py

mergify bot removed the needs-rebase label Aug 1, 2025

njhill added 2 commits August 1, 2025 19:45

linting fix

9246ac8

Signed-off-by: Nick Hill <[email protected]>

Merge remote-tracking branch 'refs/remotes/origin/main' into fix-dp-b…

d058ca6

…alancing

tlrmchlsmth approved these changes Aug 1, 2025

View reviewed changes

vllm-bot merged commit 8d524ce into vllm-project:main Aug 2, 2025
39 of 42 checks passed

njhill deleted the fix-dp-balancing branch August 4, 2025 08:55

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[BugFix] Improve internal DP load balancing (vllm-project#21617)

141274b

Signed-off-by: Nick Hill <[email protected]>

jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025

[BugFix] Improve internal DP load balancing (vllm-project#21617)

a698cfc

Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>

noamgat pushed a commit to noamgat/vllm that referenced this pull request Aug 9, 2025

[BugFix] Improve internal DP load balancing (vllm-project#21617)

25ccf8d

Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Noam Gat <[email protected]>

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

[BugFix] Improve internal DP load balancing (vllm-project#21617)

f06964b

Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Paul Pak <[email protected]>

diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025

[BugFix] Improve internal DP load balancing (vllm-project#21617)

ca8c50b

Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Diego-Castan <[email protected]>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[BugFix] Improve internal DP load balancing (vllm-project#21617)

84c984f

Signed-off-by: Nick Hill <[email protected]>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[BugFix] Improve internal DP load balancing (vllm-project#21617)

a466d01

Signed-off-by: Nick Hill <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BugFix] Improve internal DP load balancing #21617

[BugFix] Improve internal DP load balancing #21617

Uh oh!

njhill commented Jul 25, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Aug 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[BugFix] Improve internal DP load balancing #21617

[BugFix] Improve internal DP load balancing #21617

Uh oh!

Conversation

njhill commented Jul 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Aug 1, 2025

Uh oh!

Uh oh!

Uh oh!

njhill commented Jul 25, 2025 •

edited by github-actions bot

Loading