-
-
Notifications
You must be signed in to change notification settings - Fork 10.1k
Re-enable prefill of max model length #24446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Re-enable prefill of max model length #24446
Conversation
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request relaxes an assertion to re-enable prefilling up to the maximum model length and sampling a single token. While the intent is correct, the change as-is will likely cause an IndexError
because the underlying buffer for token IDs is not large enough to accommodate the extra token. A fix is required in vllm/v1/worker/gpu_input_batch.py
(and likely vllm/v1/worker/tpu_input_batch.py
) to increase the buffer size.
assert end_idx <= self.max_model_len + 1, ( | ||
"Sampled token IDs exceed the max model length + 1. " | ||
f"Total number of tokens: {end_idx} > max_model_len + 1: " | ||
f"{self.max_model_len + 1}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this change correctly relaxes the assertion to allow for prefilling up to max_model_len
and sampling one more token, it seems to introduce a potential IndexError
in the subsequent line.
The buffer self.input_batch.token_ids_cpu
is initialized with a size of (max_num_reqs, max_model_len)
in vllm/v1/worker/gpu_input_batch.py
.
When end_idx
is self.max_model_len + 1
, the slice assignment self.input_batch.token_ids_cpu[req_idx, start_idx:end_idx] = sampled_ids
will attempt to write to an out-of-bounds index (max_model_len
).
To fix this, the token_ids_cpu
buffer should probably be initialized with a size of (max_num_reqs, max_model_len + 1)
. This change would be required in vllm/v1/worker/gpu_input_batch.py
. A similar change might be needed for vllm/v1/worker/tpu_input_batch.py
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if the change in vllm/v1/worker/tpu_input_batch.py
is needed.
Signed-off-by: Yannick Schnider <[email protected]>
@WoosukKwon @LucasWilkinson tagging you guys here as author/reviewer of #20291 |
Re-enable prefill of max model length
Purpose
Previous to #20291 it was possible to do a prefill on the max model length and sample a single token. However, #20291 introduced an assertion:
which does not allow this anymore.
This PR allows the initial behavior again. Note that this initial behavior (allowing to do a prefill and requesting a single token on the max model length) is consistent with huggingface transformer behavior.
Test Plan
Can be tested with any decoder model with a prompt of length
max_model_len
and settingmax_tokens=1
.Test Result
e.g before this PR
max_model_len=2048
andmax_tokens=1
resulted in this warning:after this PR: the assertion error does not show up and the correct output token is returned.