Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion vllm/v1/worker/gpu_input_batch.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ def __init__(
# This buffer is not directly transferred to the GPU, so it does not
# need to be pinned.
self.token_ids_cpu_tensor = torch.zeros(
(max_num_reqs, max_model_len),
(max_num_reqs, max_model_len + 1),
device="cpu",
dtype=torch.int32,
pin_memory=False,
Expand Down
8 changes: 4 additions & 4 deletions vllm/v1/worker/gpu_model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -1819,10 +1819,10 @@ def _bookkeeping_sync(

start_idx = self.input_batch.num_tokens_no_spec[req_idx]
end_idx = start_idx + len(sampled_ids)
assert end_idx <= self.max_model_len, (
"Sampled token IDs exceed the max model length. "
f"Total number of tokens: {end_idx} > max_model_len: "
f"{self.max_model_len}")
assert end_idx <= self.max_model_len + 1, (
"Sampled token IDs exceed the max model length + 1. "
f"Total number of tokens: {end_idx} > max_model_len + 1: "
f"{self.max_model_len + 1}")
Comment on lines +1822 to +1825
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

While this change correctly relaxes the assertion to allow for prefilling up to max_model_len and sampling one more token, it seems to introduce a potential IndexError in the subsequent line.

The buffer self.input_batch.token_ids_cpu is initialized with a size of (max_num_reqs, max_model_len) in vllm/v1/worker/gpu_input_batch.py.

When end_idx is self.max_model_len + 1, the slice assignment self.input_batch.token_ids_cpu[req_idx, start_idx:end_idx] = sampled_ids will attempt to write to an out-of-bounds index (max_model_len).

To fix this, the token_ids_cpu buffer should probably be initialized with a size of (max_num_reqs, max_model_len + 1). This change would be required in vllm/v1/worker/gpu_input_batch.py. A similar change might be needed for vllm/v1/worker/tpu_input_batch.py as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the change in vllm/v1/worker/tpu_input_batch.py is needed.


self.input_batch.token_ids_cpu[req_idx,
start_idx:end_idx] = sampled_ids
Expand Down
2 changes: 1 addition & 1 deletion vllm/v1/worker/tpu_input_batch.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ def __init__(
# This buffer is not directly transferred to the GPU, so it does not
# need to be pinned.
self.token_ids_cpu_tensor = torch.zeros(
(max_num_reqs, max_model_len),
(max_num_reqs, max_model_len + 1),
device="cpu",
dtype=torch.int32,
pin_memory=False,
Expand Down