Re-enable prefill of max model length #24446

yannicks1 · 2025-09-08T13:49:35Z

Re-enable prefill of max model length

Purpose

Previous to #20291 it was possible to do a prefill on the max model length and sample a single token. However, #20291 introduced an assertion:

start_idx = self.input_batch.num_tokens_no_spec[req_idx]
end_idx = start_idx + len(sampled_ids)
assert end_idx <= self.max_model_len <-----

which does not allow this anymore.

This PR allows the initial behavior again. Note that this initial behavior (allowing to do a prefill and requesting a single token on the max model length) is consistent with huggingface transformer behavior.

Test Plan

Can be tested with any decoder model with a prompt of length max_model_len and setting max_tokens=1.

Test Result

e.g before this PR max_model_len=2048 and max_tokens=1 resulted in this warning:

AssertionError: Sampled token IDs exceed the max model length. Total number of tokens: 2049 > max_model_len: 2048

after this PR: the assertion error does not show up and the correct output token is returned.

Signed-off-by: Yannick Schnider <[email protected]>

gemini-code-assist

Code Review

This pull request relaxes an assertion to re-enable prefilling up to the maximum model length and sampling a single token. While the intent is correct, the change as-is will likely cause an IndexError because the underlying buffer for token IDs is not large enough to accommodate the extra token. A fix is required in vllm/v1/worker/gpu_input_batch.py (and likely vllm/v1/worker/tpu_input_batch.py) to increase the buffer size.

gemini-code-assist · 2025-09-08T13:54:32Z

vllm/v1/worker/gpu_model_runner.py

+            assert end_idx <= self.max_model_len + 1, (
+                "Sampled token IDs exceed the max model length + 1. "
+                f"Total number of tokens: {end_idx} > max_model_len + 1: "
+                f"{self.max_model_len + 1}")


While this change correctly relaxes the assertion to allow for prefilling up to max_model_len and sampling one more token, it seems to introduce a potential IndexError in the subsequent line.

The buffer self.input_batch.token_ids_cpu is initialized with a size of (max_num_reqs, max_model_len) in vllm/v1/worker/gpu_input_batch.py.

When end_idx is self.max_model_len + 1, the slice assignment self.input_batch.token_ids_cpu[req_idx, start_idx:end_idx] = sampled_ids will attempt to write to an out-of-bounds index (max_model_len).

To fix this, the token_ids_cpu buffer should probably be initialized with a size of (max_num_reqs, max_model_len + 1). This change would be required in vllm/v1/worker/gpu_input_batch.py. A similar change might be needed for vllm/v1/worker/tpu_input_batch.py as well.

Not sure if the change in vllm/v1/worker/tpu_input_batch.py is needed.

Signed-off-by: Yannick Schnider <[email protected]>

yannicks1 · 2025-09-09T16:31:20Z

@WoosukKwon @LucasWilkinson tagging you guys here as author/reviewer of #20291

yannicks1 added 2 commits September 8, 2025 15:15

re-enable prefilll of entire model length

ffb7cd6

Signed-off-by: Yannick Schnider <[email protected]>

adjust log message

49f3335

Signed-off-by: Yannick Schnider <[email protected]>

yannicks1 requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners September 8, 2025 13:49

mergify bot added the v1 label Sep 8, 2025

gemini-code-assist bot reviewed Sep 8, 2025

View reviewed changes

increase buffer size

ea31082

Signed-off-by: Yannick Schnider <[email protected]>

mergify bot added the tpu Related to Google TPUs label Sep 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Re-enable prefill of max model length #24446

Re-enable prefill of max model length #24446

yannicks1 commented Sep 8, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 8, 2025

Uh oh!

yannicks1 Sep 10, 2025

Uh oh!

yannicks1 commented Sep 9, 2025

Uh oh!

Uh oh!

Uh oh!

Re-enable prefill of max model length #24446

Are you sure you want to change the base?

Re-enable prefill of max model length #24446

Conversation

yannicks1 commented Sep 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-enable prefill of max model length

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

yannicks1 Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

yannicks1 commented Sep 9, 2025

Uh oh!

Uh oh!

yannicks1 commented Sep 8, 2025 •

edited by github-actions bot

Loading