Skip to content

Conversation

njhill
Copy link
Member

@njhill njhill commented Jul 25, 2025

Experiments showed unexpected imbalance in the internal DP load balancing, especially with large bursts of requests arriving simultaneously. There appears to be additional imbalance when using multiple API servers.

This PR adds the step count to the stats sent back from engines to the DP coordinator and a small buffer to ensure that all of the updated counts for a given step are aggregated before snap-shotting to send to the clients. This should result in a more consistent view of the point-in-time state used for balancing.

Additionally, the local request counts that are updated in the clients between updates from the coordinator and now multiplied by the total number of clients (API servers) for a better estimate of the number of new requests in each engine since the last stats update.

Testing with large-scale bursty request workloads now shows much more balanced request counts across the DP engines.

@mergify mergify bot added the v1 label Jul 25, 2025
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant improvements to the internal data parallelism load balancing. Key changes include:

  • A more robust statistic collection mechanism in the DPCoordinator that batches stats by step to ensure consistency.
  • An improved load balancing algorithm in the EngineCoreClient that uses a weighted score of waiting and running requests, which is a more effective heuristic.
  • A fix to make local load balancing adjustments in the client effective between stat updates.

The changes are well-structured. However, I've found a critical issue in the handling of out-of-order statistics in the coordinator which could lead to state corruption. Please see the detailed comment.

@mergify mergify bot added the frontend label Jul 28, 2025
Signed-off-by: Nick Hill <[email protected]>
@njhill njhill marked this pull request as ready for review July 28, 2025 15:17
@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 29, 2025
Copy link

mergify bot commented Aug 1, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 1, 2025
# Conflicts:
#	vllm/v1/engine/async_llm.py
@mergify mergify bot removed the needs-rebase label Aug 1, 2025
@vllm-bot vllm-bot merged commit 8d524ce into vllm-project:main Aug 2, 2025
39 of 42 checks passed
@njhill njhill deleted the fix-dp-balancing branch August 4, 2025 08:55
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
noamgat pushed a commit to noamgat/vllm that referenced this pull request Aug 9, 2025
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
frontend ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants