Skip to content

Conversation

heheda12345
Copy link
Collaborator

@heheda12345 heheda12345 commented Aug 26, 2025

Purpose

Current beam search will redo the prefill of all tokens for the generation of each token if the prefix cache cannot hold all tokens. This PR introduces a concurency_limit so that people can set it to a small enough value to achieve prefix cache hit

This PR also fixes a small bug in benchmark_throughput of beam search

Test Plan

python3 benchmark_throughput.py --model 'google/gemma-3-12b-it' --input-len 10000 --output-len 1000 --num-prompts 60 --n 2

Test Result

On h100, with concurrency 10 (about 10 min):

Throughput: 0.07 requests/s, 806.72 total tokens/s, 73.35 output tokens/s
Total num prompt tokens:  599907
Total num output tokens:  60000

Without concurrency optimization, I wait for 1 hour and the benchmark script doesn't finish.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Chen Zhang <[email protected]>
@heheda12345 heheda12345 requested a review from aarnphm as a code owner August 26, 2025 00:22
@heheda12345 heheda12345 requested a review from youkaichao August 26, 2025 00:23
@mergify mergify bot added frontend performance Performance-related issues labels Aug 26, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a concurrency_limit parameter to the beam search functionality to optimize performance, especially when dealing with long prompts that may not fit into the prefix cache. The changes involve batching requests within the beam search loop. My review focuses on the correctness and maintainability of these changes. I've identified a consistent typo in the new API parameter concurency_limit which should be concurrency_limit. I've also found a case of variable shadowing that could impact code readability and maintainability. Both issues are of high severity and should be addressed.

params: BeamSearchParams,
lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
use_tqdm: bool = False,
concurency_limit: Optional[int] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's a typo in the new parameter concurency_limit. It should be concurrency_limit. This typo appears consistently across the PR in function arguments, variable names, and documentation. Please correct it everywhere for clarity and correctness, including in tests/conftest.py and tests/samplers/test_beam_search.py.

Suggested change
concurency_limit: Optional[int] = None,
concurrency_limit: Optional[int] = None,

Comment on lines +761 to +763
for i in range(start, end):
current_beam = all_beams[i]
result = output[i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The loop variable i on this line shadows the i from the outer loop at line 721 (for i in range(0, len(prompts), concurency_limit):). This is generally bad practice as it can lead to confusion and potential bugs. Please use a more descriptive and non-conflicting name for the inner loop variable, for example beam_idx.

Suggested change
for i in range(start, end):
current_beam = all_beams[i]
result = output[i]
for beam_idx in range(start, end):
current_beam = all_beams[beam_idx]
result = output[beam_idx]

Signed-off-by: Chen Zhang <[email protected]>
@heheda12345 heheda12345 enabled auto-merge (squash) August 26, 2025 18:48
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 26, 2025
@DarkLight1337
Copy link
Member

Can you update from main brance to fix the CI failure in entrypoints tests? Meanwhile the samplers test looks related to this PR

@heheda12345 heheda12345 merged commit 142ac08 into vllm-project:main Aug 27, 2025
39 checks passed
@heheda12345 heheda12345 deleted the fix_beam_search branch August 27, 2025 04:59
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Sep 3, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
frontend performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants