Skip to content

Conversation

NickLucche
Copy link
Collaborator

@NickLucche NickLucche commented Sep 1, 2025

Fix #24006 by enabling proper batched audio_tower inference.
This is done by simply padding the sequences to the max seq len.

Thanks to @pratapyash for reporting the bug!

Test with

# vllm serve google/gemma-3n-E2B-it 
(vllm) ➜  vllm git:(gemma3n-fix-batch) ✗ python examples/online_serving/openai_chat_completion_client_for_multimodal.py -c multi-audio

Chat completion output from input audio: No, they are not the same. The first audio is of a sports broadcast announcing a baseball game. The second audio appears to be a recitation of an ancient poem, possibly related to mythology or religion, in Italian.

Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
@mergify mergify bot added the documentation Improvements or additions to documentation label Sep 1, 2025
@NickLucche
Copy link
Collaborator Author

cc @DarkLight1337

input_features_mask=MultiModalFieldConfig.batched("audio"))
return dict(
pixel_values=MultiModalFieldConfig.batched("image"),
input_features=MultiModalFieldConfig.batched("audio"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need input_features in that case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely want to review that once I enable that processor test that required a hf transformer bump.
For now there's no big overhead at runtime 'cause the unpadded it's just a view.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes an issue with batched audio inference for Gemma3n models by padding audio sequences. The core logic involves introducing a padded version of input_features for batched processing by the audio tower, while keeping an unpadded version for caching. The changes are generally good, but I've identified a critical issue with a .squeeze(1) call that will likely cause a crash, and a high-severity issue with an incorrect type hint.

assert self.audio_tower is not None
input_features = audio_input["input_features"].squeeze(1)
# Run on padded features to enable batching
input_features = audio_input["input_features_padded"].squeeze(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The use of .squeeze(1) here is likely incorrect and will cause a runtime error. input_features_padded is expected to have a shape of (batch_size, seq_length, num_features). Calling .squeeze(1) will only succeed if seq_length is 1, which is not generally the case for audio features. This seems to be a pre-existing issue, but since this line is modified, it's important to address it. The .squeeze(1) should probably be removed.

Suggested change
input_features = audio_input["input_features_padded"].squeeze(1)
input_features = audio_input["input_features_padded"]

@pratapyash
Copy link
Contributor

Thanks for the fix @NickLucche !

Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as tests still pass

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 1, 2025
@NickLucche
Copy link
Collaborator Author

@DarkLight1337 looks green

@DarkLight1337 DarkLight1337 merged commit 0a74e9d into vllm-project:main Sep 2, 2025
42 checks passed
akaihaoshuai pushed a commit to akaihaoshuai/vllm that referenced this pull request Sep 3, 2025
eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Gemma3n audio path crashes when input_features is a list not a Tensor.
3 participants