[Bugfix][CI][V1] Work around V1 + CUDA Graph + torch._scaled_mm fallback issue #13425
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR works around an issue where vLLM V1, using Ada Lovelace GPUs, and when building vLLM with CUDA < 12.4, FP8 models with per-channel and/or per-tensor quantization will produce garbage output.
Closes #13212
AFAICT, there is some issue with the way we are setting up
TORCH_DEVICE_IDENTITY
Alternatively, now that
torch._scaled_mm
supports rowwise scaling we could use that instead of the fallback. Unfortunately this only works if the model's dtype is bf16. Otherwise we get the error:Definitely am not happy about the approach here as is puts the onus on the caller of
apply_fp8_linear
to also callmaybe_create_device_identity
before the forward pass.