-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
[BugFix] Fix IMA FlashMLA full cuda-graph and DP + Update FlashMLA #21691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] Fix IMA FlashMLA full cuda-graph and DP + Update FlashMLA #21691
Conversation
Signed-off-by: Lucas Wilkinson <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly addresses a potential issue with FlashMLA under full CUDA graph capture, especially in distributed settings. By moving the buffer allocation from a lazy, in-place approach to a pre-allocation strategy in the __init__
method, the code becomes more robust and compliant with CUDA graph requirements. The changes are logical and well-implemented. I have one suggestion to replace a magic number with a constant to improve long-term maintainability.
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
cfe7b52
to
7492f99
Compare
Signed-off-by: Lucas Wilkinson <[email protected]>
7492f99
to
1035a64
Compare
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vllm serve deepseek-ai/DeepSeek-V2-Lite --port 9256 --enable-expert-parallel --data-parallel-size 2 --trust-remote-code -O '{"full_cuda_graph": true}' --cuda-graph-sizes 16 32 64 128 256 512
Originally:
(EngineCore_0 pid=94527) AssertionError
(EngineCore_1 pid=94528) answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_1 pid=94528) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=94528) File "/home/wentao/vllm/vllm/utils/__init__.py", line 2948, in run_method
(EngineCore_1 pid=94528) return func(*args, **kwargs)
(EngineCore_1 pid=94528) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=94528) File "/home/wentao/vllm/vllm/v1/worker/gpu_worker.py", line 330, in compile_or_warm_up_model
(EngineCore_1 pid=94528) self.model_runner._dummy_run(
(EngineCore_1 pid=94528) File "/home/wentao/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(EngineCore_1 pid=94528) return func(*args, **kwargs)
(EngineCore_1 pid=94528) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=94528) File "/home/wentao/vllm/vllm/v1/worker/gpu_model_runner.py", line 2206, in _dummy_run
(EngineCore_1 pid=94528) .build_for_cudagraph_capture(common_attn_metadata)
(EngineCore_1 pid=94528) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=94528) File "/home/wentao/vllm/vllm/v1/attention/backends/mla/common.py", line 580, in build_for_cudagraph_capture
(EngineCore_1 pid=94528) return self.build(0, m)
(EngineCore_1 pid=94528) ^^^^^^^^^^^^^^^^
(EngineCore_1 pid=94528) File "/home/wentao/vllm/vllm/v1/attention/backends/mla/common.py", line 705, in build
(EngineCore_1 pid=94528) decode_metadata = self._build_decode(
(EngineCore_1 pid=94528) ^^^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=94528) File "/home/wentao/vllm/vllm/v1/attention/backends/mla/flashmla.py", line 100, in _build_decode
(EngineCore_1 pid=94528) assert n <= self.cg_buf_num_splits.size(0)
(EngineCore_1 pid=94528) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=94528) AssertionError
Now:
(APIServer pid=126325) INFO: Started server process [126325]
(APIServer pid=126325) INFO: Waiting for application startup.
(APIServer pid=126325) INFO: Application startup complete.
So I think this PR fixed the issue, thanks for the work! @tlrmchlsmth Could you trigger CI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…llm-project#21691) Signed-off-by: Lucas Wilkinson <[email protected]> Co-authored-by: yewentao256 <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
…llm-project#21691) Signed-off-by: Lucas Wilkinson <[email protected]> Co-authored-by: yewentao256 <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Signed-off-by: Noam Gat <[email protected]>
…llm-project#21691) Signed-off-by: Lucas Wilkinson <[email protected]> Co-authored-by: yewentao256 <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Signed-off-by: Paul Pak <[email protected]>
…llm-project#21691) Signed-off-by: Lucas Wilkinson <[email protected]> Co-authored-by: yewentao256 <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
…llm-project#21691) Signed-off-by: Lucas Wilkinson <[email protected]> Co-authored-by: yewentao256 <[email protected]> Co-authored-by: Wentao Ye <[email protected]>
…llm-project#21691) Signed-off-by: Lucas Wilkinson <[email protected]> Co-authored-by: yewentao256 <[email protected]> Co-authored-by: Wentao Ye <[email protected]>
…llm-project#21691) Signed-off-by: Lucas Wilkinson <[email protected]> Co-authored-by: yewentao256 <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
…llm-project#21691) Signed-off-by: Lucas Wilkinson <[email protected]> Co-authored-by: yewentao256 <[email protected]> Co-authored-by: Wentao Ye <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Merge: vllm-project/FlashMLA#3 first
Fix an IMA that occurs when using FlashMLA with full-cudagraphs and wide-ep
Also updates FlashMLA (i.e. #17027) since the FlashMLA changes were made on top of that. #17027 was back-burnered since it shows a slight slowdown in the TP attention case but should provide speedup for DP attention.
Test Plan
Test was failing on an llm-d benchmark
Test Result
Fixes the llm-d benchmark
(Optional) Documentation Update