-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
[Frontend] Optimize beam search performance by limiting concurrency #23599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Chen Zhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a concurrency_limit
parameter to the beam search functionality to optimize performance, especially when dealing with long prompts that may not fit into the prefix cache. The changes involve batching requests within the beam search loop. My review focuses on the correctness and maintainability of these changes. I've identified a consistent typo in the new API parameter concurency_limit
which should be concurrency_limit
. I've also found a case of variable shadowing that could impact code readability and maintainability. Both issues are of high severity and should be addressed.
vllm/entrypoints/llm.py
Outdated
params: BeamSearchParams, | ||
lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None, | ||
use_tqdm: bool = False, | ||
concurency_limit: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo in the new parameter concurency_limit
. It should be concurrency_limit
. This typo appears consistently across the PR in function arguments, variable names, and documentation. Please correct it everywhere for clarity and correctness, including in tests/conftest.py
and tests/samplers/test_beam_search.py
.
concurency_limit: Optional[int] = None, | |
concurrency_limit: Optional[int] = None, |
for i in range(start, end): | ||
current_beam = all_beams[i] | ||
result = output[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop variable i
on this line shadows the i
from the outer loop at line 721 (for i in range(0, len(prompts), concurency_limit):
). This is generally bad practice as it can lead to confusion and potential bugs. Please use a more descriptive and non-conflicting name for the inner loop variable, for example beam_idx
.
for i in range(start, end): | |
current_beam = all_beams[i] | |
result = output[i] | |
for beam_idx in range(start, end): | |
current_beam = all_beams[beam_idx] | |
result = output[beam_idx] |
Signed-off-by: Chen Zhang <[email protected]>
Can you update from main brance to fix the CI failure in entrypoints tests? Meanwhile the samplers test looks related to this PR |
Signed-off-by: Chen Zhang <[email protected]>
…llm-project#23599) Signed-off-by: Chen Zhang <[email protected]>
…llm-project#23599) Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
…llm-project#23599) Signed-off-by: Chen Zhang <[email protected]>
…llm-project#23599) Signed-off-by: Chen Zhang <[email protected]>
…llm-project#23599) Signed-off-by: Chen Zhang <[email protected]>
Purpose
Current beam search will redo the prefill of all tokens for the generation of each token if the prefix cache cannot hold all tokens. This PR introduces a
concurency_limit
so that people can set it to a small enough value to achieve prefix cache hitThis PR also fixes a small bug in benchmark_throughput of beam search
Test Plan
Test Result
On h100, with concurrency 10 (about 10 min):
Without concurrency optimization, I wait for 1 hour and the benchmark script doesn't finish.
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.