[Frontend] Optimize beam search performance by limiting concurrency #23599

heheda12345 · 2025-08-26T00:22:49Z

Purpose

Current beam search will redo the prefill of all tokens for the generation of each token if the prefix cache cannot hold all tokens. This PR introduces a concurency_limit so that people can set it to a small enough value to achieve prefix cache hit

This PR also fixes a small bug in benchmark_throughput of beam search

Test Plan

python3 benchmark_throughput.py --model 'google/gemma-3-12b-it' --input-len 10000 --output-len 1000 --num-prompts 60 --n 2

Test Result

On h100, with concurrency 10 (about 10 min):

Throughput: 0.07 requests/s, 806.72 total tokens/s, 73.35 output tokens/s
Total num prompt tokens:  599907
Total num output tokens:  60000

Without concurrency optimization, I wait for 1 hour and the benchmark script doesn't finish.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Chen Zhang <[email protected]>

gemini-code-assist

Code Review

This pull request introduces a concurrency_limit parameter to the beam search functionality to optimize performance, especially when dealing with long prompts that may not fit into the prefix cache. The changes involve batching requests within the beam search loop. My review focuses on the correctness and maintainability of these changes. I've identified a consistent typo in the new API parameter concurency_limit which should be concurrency_limit. I've also found a case of variable shadowing that could impact code readability and maintainability. Both issues are of high severity and should be addressed.

gemini-code-assist · 2025-08-26T00:24:14Z

vllm/entrypoints/llm.py

        params: BeamSearchParams,
        lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
        use_tqdm: bool = False,
+        concurency_limit: Optional[int] = None,


There's a typo in the new parameter concurency_limit. It should be concurrency_limit. This typo appears consistently across the PR in function arguments, variable names, and documentation. Please correct it everywhere for clarity and correctness, including in tests/conftest.py and tests/samplers/test_beam_search.py.

Suggested change

concurency_limit: Optional[int] = None,

concurrency_limit: Optional[int] = None,

gemini-code-assist · 2025-08-26T00:24:14Z

vllm/entrypoints/llm.py

+                    for i in range(start, end):
+                        current_beam = all_beams[i]
+                        result = output[i]


The loop variable i on this line shadows the i from the outer loop at line 721 (for i in range(0, len(prompts), concurency_limit):). This is generally bad practice as it can lead to confusion and potential bugs. Please use a more descriptive and non-conflicting name for the inner loop variable, for example beam_idx.

Suggested change

for i in range(start, end):

current_beam = all_beams[i]

result = output[i]

for beam_idx in range(start, end):

current_beam = all_beams[beam_idx]

result = output[beam_idx]

Signed-off-by: Chen Zhang <[email protected]>

DarkLight1337 · 2025-08-27T01:22:27Z

Can you update from main brance to fix the CI failure in entrypoints tests? Meanwhile the samplers test looks related to this PR

Signed-off-by: Chen Zhang <[email protected]>

…llm-project#23599) Signed-off-by: Chen Zhang <[email protected]>

…llm-project#23599) Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

…llm-project#23599) Signed-off-by: Chen Zhang <[email protected]>

optimize beam search

df2bed4

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 requested a review from aarnphm as a code owner August 26, 2025 00:22

heheda12345 requested a review from youkaichao August 26, 2025 00:23

mergify bot added frontend performance Performance-related issues labels Aug 26, 2025

gemini-code-assist bot reviewed Aug 26, 2025

View reviewed changes

youkaichao approved these changes Aug 26, 2025

View reviewed changes

fix gemini review

785c062

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 enabled auto-merge (squash) August 26, 2025 18:48

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 26, 2025

heheda12345 added 2 commits August 26, 2025 19:39

fix test

86e3667

Signed-off-by: Chen Zhang <[email protected]>

Merge branch 'main' of github.com:vllm-project/vllm into fix_beam_search

c1c6950

heheda12345 merged commit 142ac08 into vllm-project:main Aug 27, 2025
39 checks passed

heheda12345 deleted the fix_beam_search branch August 27, 2025 04:59

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[Frontend] Optimize beam search performance by limiting concurrency (v…

2d7c794

…llm-project#23599) Signed-off-by: Chen Zhang <[email protected]>

xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025

[Frontend] Optimize beam search performance by limiting concurrency (v…

70f9534

…llm-project#23599) Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[Frontend] Optimize beam search performance by limiting concurrency (v…

5da9022

…llm-project#23599) Signed-off-by: Chen Zhang <[email protected]>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Sep 3, 2025

[Frontend] Optimize beam search performance by limiting concurrency (v…

5fb50f9

…llm-project#23599) Signed-off-by: Chen Zhang <[email protected]>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Frontend] Optimize beam search performance by limiting concurrency (v…

5f0064d

…llm-project#23599) Signed-off-by: Chen Zhang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Frontend] Optimize beam search performance by limiting concurrency #23599

[Frontend] Optimize beam search performance by limiting concurrency #23599

heheda12345 commented Aug 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 26, 2025

Uh oh!

gemini-code-assist bot Aug 26, 2025

Uh oh!

DarkLight1337 commented Aug 27, 2025

Uh oh!

Uh oh!

Uh oh!

	concurency_limit: Optional[int] = None,
	concurrency_limit: Optional[int] = None,

Uh oh!

[Frontend] Optimize beam search performance by limiting concurrency #23599

[Frontend] Optimize beam search performance by limiting concurrency #23599

Conversation

heheda12345 commented Aug 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Aug 27, 2025

Uh oh!

Uh oh!

Uh oh!

heheda12345 commented Aug 26, 2025 •

edited by github-actions bot

Loading