-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
[Core] Support full cuda graph in v1 #16072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Chanh Nguyen <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Chanh Nguyen <[email protected]>
Signed-off-by: Chanh Nguyen <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Chanh Nguyen <[email protected]>
Signed-off-by: Chanh Nguyen <[email protected]>
@chanh thanks for the PR, I have tested llama 8b on my side with your PR and I see ~7% improvement for TPOT. Great work! Before PR:
After PR:
|
just a heads up @zou3519 |
Thanks for @alexm-redhat for verifying! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chanh went over the PR in detail, looks really good. Left some comments. Thanks for adding the test, I think it can be expanded a bit to cover CUDA graph's edge cases a bit better.
@chanh tell me if you need help with extending the tests, I can do it on my side. |
Thanks for the PR! I will review it this weekend (maybe Tyler and Rob, too). |
I ran some latency-focused testing on this PR using LLaMA 3.2 1B Instruct with a small batch size (~1-2) in a highly latency-constrained setting where minimizing CUDA graph launches can significantly improve GPU utilization. Here are the results: Before PR:
After PR:
This shows a notable improvement across the board, particularly in tail latencies. Great work! |
@chanh Thanks for pushing this through! |
I think we may need to disable ahead-of-time scheduling for FA3 when using full cuda-graph: vllm/vllm/v1/attention/backends/flash_attn.py Lines 341 to 354 in 1a6af14
since this scheduler may choose a different number of splits than what the graph was captured with do we have lm-eval accuracy results with full cuda-graphs on? |
Will discuss with you over Slack |
Signed-off-by: Chanh Nguyen <[email protected]>
Okay disabled it for now. |
Signed-off-by: Chanh Nguyen <[email protected]>
lm-eval results
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks!
Signed-off-by: Chanh Nguyen <[email protected]> Co-authored-by: Chanh Nguyen <[email protected]> Signed-off-by: 汪志鹏 <[email protected]>
Signed-off-by: Chanh Nguyen <[email protected]> Co-authored-by: Chanh Nguyen <[email protected]> Signed-off-by: Mu Huai <[email protected]>
What is special about local attention? |
Signed-off-by: Chanh Nguyen <[email protected]> Co-authored-by: Chanh Nguyen <[email protected]>
Signed-off-by: Chanh Nguyen <[email protected]> Co-authored-by: Chanh Nguyen <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>
@chanh It seems that full cuda graph support outputs garbage on latest main. Do you have any idea? |
@chanh +1 - it seems like the test was never added to CI (needs to be added manually to |
}) | ||
|
||
with set_forward_context(None, | ||
with set_forward_context(attn_metadata, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering that self.maybe_setup_kv_connector(scheduler_output)
is not executed here, in the Full Cuda Graph scenario, the sequence unified_attention_with_output
-> maybe_save_kv_layer_to_connector
-> connector.save_kv_layer()
will cause the connector to read uninitialized metadata.
Therefore, Full Cuda Graph should be incompatible with kvconnector?
Hello, I have changed the code to make full Cudagraph capture with FA2. The result shows that FA2 can also work correctly. |
@WoosukKwon This maybe helpful. Regarding to FA2, FlashInfer (flashinfer-ai/flashinfer#1137) recently merges a PR that implements persistent-style FA2 template. This PR unifies prefill and decode, which supports a single cuda-graph for all batcheds and sequence lengths. |
Summary
Support capturing a single CUDA graph for the entire model's forward pass, instead of piecewise graphs. This requires creating persistent buffers to make attention graphable. Credit to @tlrmchlsmth for the original implementation.
Limitations:
Work in progress:
This reduces median TPOT by 7% for small models like Qwen 2.5 1.5B.
Before
With piecewise, there are multiple kernel launches per layer, with more gaps between the kernel execution (13ms time to decide one token in profiling mode):

After
There is now a single kernel launch, with almost no gaps between kernel execution (6ms time to decode one token in profiling mode):

** Above benchmarks performed with: