[Core] Support full cuda graph in v1 #16072

chanh · 2025-04-04T20:56:30Z

Summary

Support capturing a single CUDA graph for the entire model's forward pass, instead of piecewise graphs. This requires creating persistent buffers to make attention graphable. Credit to @tlrmchlsmth for the original implementation.

Limitations:

This only works with V1 + FA3, since FA2 currently is not graphable due to an optimization for GQA.
This doesn't work with Cascade Attention.

Work in progress:

Investigating changes needed to make this work with Llama4 / local attention

This reduces median TPOT by 7% for small models like Qwen 2.5 1.5B.

Before

With piecewise, there are multiple kernel launches per layer, with more gaps between the kernel execution (13ms time to decide one token in profiling mode):

============ Serving Benchmark Result ============
Successful requests:                     100       
Benchmark duration (s):                  103.15    
Total input tokens:                      100000    
Total generated tokens:                  10000     
Request throughput (req/s):              0.97      
Output token throughput (tok/s):         96.95     
Total Token throughput (tok/s):          1066.46   
---------------Time to First Token----------------
Mean TTFT (ms):                          29.08     
Median TTFT (ms):                        28.89     
P99 TTFT (ms):                           36.17     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          5.75      
Median TPOT (ms):                        5.75      
P99 TPOT (ms):                           6.00      
---------------Inter-token Latency----------------
Mean ITL (ms):                           5.75      
Median ITL (ms):                         5.70      
P99 ITL (ms):                            6.58      
==================================================

After

There is now a single kernel launch, with almost no gaps between kernel execution (6ms time to decode one token in profiling mode):

============ Serving Benchmark Result ============
Successful requests:                     100       
Benchmark duration (s):                  103.10    
Total input tokens:                      100000    
Total generated tokens:                  10000     
Request throughput (req/s):              0.97      
Output token throughput (tok/s):         96.99     
Total Token throughput (tok/s):          1066.92   
---------------Time to First Token----------------
Mean TTFT (ms):                          29.52     
Median TTFT (ms):                        30.47     
P99 TTFT (ms):                           39.97     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          5.31      
Median TPOT (ms):                        5.33      
P99 TPOT (ms):                           5.56      
---------------Inter-token Latency----------------
Mean ITL (ms):                           5.31      
Median ITL (ms):                         5.27      
P99 ITL (ms):                            6.18      
==================================================

** Above benchmarks performed with:

VLLM_FLASH_ATTN_VERSION=3 VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-1.5B-Instruct  --enable-prefix-caching --dtype float16 --disable-log-requests -O3 (or -O4)

vllm bench serve \
        --model Qwen/Qwen2.5-1.5B-Instruct \
        --request-rate 1 \
        --num-prompts 100 \
        --random-input-len 1000 \
        --random-output-len 100 \
        --tokenizer Qwen/Qwen2.5-1.5B-Instruct \
        --ignore-eos

Signed-off-by: Chanh Nguyen <[email protected]>

github-actions · 2025-04-04T20:56:40Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Chanh Nguyen <[email protected]>

mergify · 2025-04-08T06:29:37Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chanh.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Chanh Nguyen <[email protected]>

alexm-redhat · 2025-04-08T19:55:33Z

@chanh thanks for the PR, I have tested llama 8b on my side with your PR and I see ~7% improvement for TPOT. Great work!

Before PR:

============ Serving Benchmark Result ============
Successful requests:                     50        
Benchmark duration (s):                  45.05     
Total input tokens:                      25600     
Total generated tokens:                  12800     
Request throughput (req/s):              1.11      
Output token throughput (tok/s):         284.11    
Total Token throughput (tok/s):          852.34    
---------------Time to First Token----------------
Mean TTFT (ms):                          22.43     
Median TTFT (ms):                        22.10     
P99 TTFT (ms):                           27.48     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.63      
Median TPOT (ms):                        7.63      
P99 TPOT (ms):                           7.77      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.63      
Median ITL (ms):                         7.61      
P99 ITL (ms):                            8.45      
==================================================

After PR:

============ Serving Benchmark Result ============
Successful requests:                     50        
Benchmark duration (s):                  44.93     
Total input tokens:                      25600     
Total generated tokens:                  12800     
Request throughput (req/s):              1.11      
Output token throughput (tok/s):         284.88    
Total Token throughput (tok/s):          854.64    
---------------Time to First Token----------------
Mean TTFT (ms):                          22.72     
Median TTFT (ms):                        22.93     
P99 TTFT (ms):                           27.49     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.14      
Median TPOT (ms):                        7.14      
P99 TPOT (ms):                           7.28      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.14      
Median ITL (ms):                         7.12      
P99 ITL (ms):                            8.05      
==================================================

sarckk · 2025-04-09T03:25:57Z

Work in progress:

Investigating changes needed to make this work with Llama4 / local attention

just a heads up @zou3519

chanh · 2025-04-09T19:05:13Z

@chanh thanks for the PR, I have tested llama 8b on my side with your PR and I see ~7% improvement for TPOT. Great work!

Before PR:

============ Serving Benchmark Result ============
Successful requests:                     50        
Benchmark duration (s):                  45.05     
Total input tokens:                      25600     
Total generated tokens:                  12800     
Request throughput (req/s):              1.11      
Output token throughput (tok/s):         284.11    
Total Token throughput (tok/s):          852.34    
---------------Time to First Token----------------
Mean TTFT (ms):                          22.43     
Median TTFT (ms):                        22.10     
P99 TTFT (ms):                           27.48     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.63      
Median TPOT (ms):                        7.63      
P99 TPOT (ms):                           7.77      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.63      
Median ITL (ms):                         7.61      
P99 ITL (ms):                            8.45      
==================================================

After PR:

============ Serving Benchmark Result ============
Successful requests:                     50        
Benchmark duration (s):                  44.93     
Total input tokens:                      25600     
Total generated tokens:                  12800     
Request throughput (req/s):              1.11      
Output token throughput (tok/s):         284.88    
Total Token throughput (tok/s):          854.64    
---------------Time to First Token----------------
Mean TTFT (ms):                          22.72     
Median TTFT (ms):                        22.93     
P99 TTFT (ms):                           27.49     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.14      
Median TPOT (ms):                        7.14      
P99 TPOT (ms):                           7.28      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.14      
Median ITL (ms):                         7.12      
P99 ITL (ms):                            8.05      
==================================================

Thanks for @alexm-redhat for verifying!

alexm-redhat

@chanh went over the PR in detail, looks really good. Left some comments. Thanks for adding the test, I think it can be expanded a bit to cover CUDA graph's edge cases a bit better.

tests/compile/piecewise/test_full_cudagraph.py

vllm/config.py

vllm/v1/worker/gpu_model_runner.py

alexm-redhat · 2025-04-11T15:41:49Z

@chanh tell me if you need help with extending the tests, I can do it on my side.

WoosukKwon · 2025-04-11T15:47:38Z

Thanks for the PR! I will review it this weekend (maybe Tyler and Rob, too).

dblincoe · 2025-04-11T16:40:24Z

I ran some latency-focused testing on this PR using LLaMA 3.2 1B Instruct with a small batch size (~1-2) in a highly latency-constrained setting where minimizing CUDA graph launches can significantly improve GPU utilization. Here are the results:

Before PR:

Average latency: 56.82 ms
p50 latency: 53.00 ms
p90 latency: 64.00 ms
p95 latency: 68.00 ms
p99 latency: 82.23 ms

After PR:

Average latency: 50.30 ms
p50 latency: 48.00 ms
p90 latency: 58.00 ms
p95 latency: 61.00 ms
p99 latency: 67.00 ms

This shows a notable improvement across the board, particularly in tail latencies. Great work!

tlrmchlsmth · 2025-05-07T13:12:12Z

@chanh Thanks for pushing this through!

LucasWilkinson · 2025-05-07T15:17:14Z

I think we may need to disable ahead-of-time scheduling for FA3 when using full cuda-graph:

vllm/vllm/v1/attention/backends/flash_attn.py

Lines 341 to 354 in 1a6af14

    
           if self.aot_schedule: 
        
               return get_scheduler_metadata( 
        
                   batch_size=batch_size, 
        
                   max_seqlen_q=max_query_len, 
        
                   max_seqlen_k=max_seq_len, 
        
                   cache_seqlens=seqlens, 
        
                   num_heads_q=self.num_heads_q, 
        
                   num_heads_kv=self.num_heads_kv, 
        
                   headdim=self.headdim, 
        
                   page_size=self.page_size, 
        
                   cu_seqlens_q=cu_query_lens, 
        
                   causal=causal, 
        
                   window_size=self.aot_sliding_window, 
        
               )

since this scheduler may choose a different number of splits than what the graph was captured with

do we have lm-eval accuracy results with full cuda-graphs on?

chanh · 2025-05-07T18:56:52Z

I think we may need to disable ahead-of-time scheduling for FA3 when using full cuda-graph:

vllm/vllm/v1/attention/backends/flash_attn.py

Lines 341 to 354 in 1a6af14

if self.aot_schedule:

return get_scheduler_metadata(

batch_size=batch_size,

max_seqlen_q=max_query_len,

max_seqlen_k=max_seq_len,

cache_seqlens=seqlens,

num_heads_q=self.num_heads_q,

num_heads_kv=self.num_heads_kv,

headdim=self.headdim,

page_size=self.page_size,

cu_seqlens_q=cu_query_lens,

causal=causal,

window_size=self.aot_sliding_window,

)

since this scheduler may choose a different number of splits than what the graph was captured with

do we have lm-eval accuracy results with full cuda-graphs on?

Will discuss with you over Slack

Signed-off-by: Chanh Nguyen <[email protected]>

chanh · 2025-05-07T20:45:39Z

I think we may need to disable ahead-of-time scheduling for FA3 when using full cuda-graph:

vllm/vllm/v1/attention/backends/flash_attn.py

Lines 341 to 354 in 1a6af14

if self.aot_schedule:

return get_scheduler_metadata(

batch_size=batch_size,

max_seqlen_q=max_query_len,

max_seqlen_k=max_seq_len,

cache_seqlens=seqlens,

num_heads_q=self.num_heads_q,

num_heads_kv=self.num_heads_kv,

headdim=self.headdim,

page_size=self.page_size,

cu_seqlens_q=cu_query_lens,

causal=causal,

window_size=self.aot_sliding_window,

)

since this scheduler may choose a different number of splits than what the graph was captured with
do we have lm-eval accuracy results with full cuda-graphs on?

Will discuss with you over Slack

Okay disabled it for now.

Signed-off-by: Chanh Nguyen <[email protected]>

chanh · 2025-05-07T23:30:00Z

I think we may need to disable ahead-of-time scheduling for FA3 when using full cuda-graph:

vllm/vllm/v1/attention/backends/flash_attn.py

Lines 341 to 354 in 1a6af14

if self.aot_schedule:

return get_scheduler_metadata(

batch_size=batch_size,

max_seqlen_q=max_query_len,

max_seqlen_k=max_seq_len,

cache_seqlens=seqlens,

num_heads_q=self.num_heads_q,

num_heads_kv=self.num_heads_kv,

headdim=self.headdim,

page_size=self.page_size,

cu_seqlens_q=cu_query_lens,

causal=causal,

window_size=self.aot_sliding_window,

)

since this scheduler may choose a different number of splits than what the graph was captured with
do we have lm-eval accuracy results with full cuda-graphs on?

Will discuss with you over Slack

Okay disabled it for now.

lm-eval results

[Current branch, Full CUDA Graph flag enabled, modified lm-eval to pass the compilation_config JSON properly to vLLM]
VLLM_FLASH_ATTN_VERSION=3 VLLM_USE_V1=1 \
lm_eval --model vllm \
  --model_args "pretrained=Qwen/Qwen2-1.5B-Instruct,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=1,compilation_config={\"full_cuda_graph\": true}" \
  --tasks gsm8k

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5982|±  |0.0135|
|     |       |strict-match    |     5|exact_match|↑  |0.5898|±  |0.0135|


[Main branch]
VLLM_FLASH_ATTN_VERSION=3 VLLM_USE_V1=1 \
lm_eval --model vllm \
  --model_args "pretrained=Qwen/Qwen2-1.5B-Instruct,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=1" \

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5951|±  |0.0135|
|     |       |strict-match    |     5|exact_match|↑  |0.5891|±  |0.0136|

LucasWilkinson

LGTM thanks!

Signed-off-by: Chanh Nguyen <[email protected]> Co-authored-by: Chanh Nguyen <[email protected]> Signed-off-by: 汪志鹏 <[email protected]>

Signed-off-by: Chanh Nguyen <[email protected]> Co-authored-by: Chanh Nguyen <[email protected]> Signed-off-by: Mu Huai <[email protected]>

renjie0 · 2025-05-13T04:56:46Z

Work in progress:

Investigating changes needed to make this work with Llama4 / local attention

just a heads up @zou3519

What is special about local attention?

Signed-off-by: Chanh Nguyen <[email protected]> Co-authored-by: Chanh Nguyen <[email protected]>

Signed-off-by: Chanh Nguyen <[email protected]> Co-authored-by: Chanh Nguyen <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

Juelianqvq · 2025-05-28T06:17:03Z

@chanh It seems that full cuda graph support outputs garbage on latest main. Do you have any idea?

ProExpertProg · 2025-05-29T15:55:49Z

@chanh +1 - it seems like the test was never added to CI (needs to be added manually to .buildkite/test-pipeline.yml). When I run the test locally, the first shape works and all the other shapes output garbage.

hidva · 2025-06-11T02:13:51Z

vllm/v1/worker/gpu_model_runner.py

                })

-            with set_forward_context(None,
+            with set_forward_context(attn_metadata,


Considering that self.maybe_setup_kv_connector(scheduler_output) is not executed here, in the Full Cuda Graph scenario, the sequence unified_attention_with_output -> maybe_save_kv_layer_to_connector -> connector.save_kv_layer() will cause the connector to read uninitialized metadata.

https://github.com/LMCache/LMCache/blob/680fbdf84e2ee1040bf4e084d43c9155a91b8d5c/lmcache/integration/vllm/vllm_v1_adapter.py#L609-L610

Therefore, Full Cuda Graph should be incompatible with kvconnector?

@simon-mo

Lmywl · 2025-06-19T12:45:18Z

This only works with V1 + FA3, since FA2 currently is not graphable due to an optimization for GQA.

Hello, I have changed the code to make full Cudagraph capture with FA2. The result shows that FA2 can also work correctly.
So, I'm curious that what does it specifically refer to that " FA2 currently is not graphable due to an optimization for GQA"

happierpig · 2025-06-19T20:26:55Z

@WoosukKwon This maybe helpful. Regarding to FA2, FlashInfer (flashinfer-ai/flashinfer#1137) recently merges a PR that implements persistent-style FA2 template. This PR unifies prefill and decode, which supports a single cuda-graph for all batcheds and sequence lengths.

xsank · 2025-08-28T02:04:59Z

@chanh #23739, do you have any idea of this problem?

Add full cuda graph support in v1

a7a16df

Signed-off-by: Chanh Nguyen <[email protected]>

mergify bot added the v1 label Apr 4, 2025

mgoin assigned tlrmchlsmth Apr 4, 2025

mgoin requested a review from tlrmchlsmth April 4, 2025 21:34

WoosukKwon self-assigned this Apr 4, 2025

Chanh Nguyen added 2 commits April 7, 2025 20:57

Linting and such

6aa65db

Signed-off-by: Chanh Nguyen <[email protected]>

Add unit test

c338c99

Signed-off-by: Chanh Nguyen <[email protected]>

mergify bot added the needs-rebase label Apr 8, 2025

Chanh Nguyen added 2 commits April 8, 2025 07:38

Add unit test

15edc9b

Signed-off-by: Chanh Nguyen <[email protected]>

Merge branch 'main' into cnguyen/fullgraph

f9d41b4

Signed-off-by: Chanh Nguyen <[email protected]>

mergify bot added ci/build and removed needs-rebase labels Apr 8, 2025

chanh marked this pull request as ready for review April 8, 2025 08:44

chanh requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners April 8, 2025 08:44

alexm-redhat requested changes Apr 10, 2025

View reviewed changes

mgoin self-requested a review April 11, 2025 14:54

Merge branch 'main' into cnguyen/fullgraph

bdb1747

mergify bot removed the needs-rebase label May 7, 2025

chanh requested a review from WoosukKwon May 7, 2025 11:26

Merge branch 'main' into cnguyen/fullgraph

3cfd971

tlrmchlsmth approved these changes May 7, 2025

View reviewed changes

Disable aot_schedule

59e52e6

Signed-off-by: Chanh Nguyen <[email protected]>

fix

22fa9df

Signed-off-by: Chanh Nguyen <[email protected]>

Merge branch 'main' into cnguyen/fullgraph

659d9b9

LucasWilkinson approved these changes May 8, 2025

View reviewed changes

simon-mo merged commit 7ea2adb into vllm-project:main May 8, 2025
51 checks passed

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request May 14, 2025

[Core] Support full cuda graph in v1 (vllm-project#16072)

f5536e5

Signed-off-by: Chanh Nguyen <[email protected]> Co-authored-by: Chanh Nguyen <[email protected]>

youkaichao mentioned this pull request May 20, 2025

[Performance]: Low GPU Utilization (70%) for ViT+Qwen2 VLM Model. #18392

Open

1 task

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025

[Core] Support full cuda graph in v1 (vllm-project#16072)

a4edca1

Signed-off-by: Chanh Nguyen <[email protected]> Co-authored-by: Chanh Nguyen <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

hidva reviewed Jun 11, 2025

View reviewed changes

This was referenced Jun 25, 2025

[Core][Bugfix] new way for full cudagraph, add support for FA2 and FlashInfer; Two minor bugs fixed #20050

Closed

[Core] Allow full cudagraph with separate attention routines and orthogonal to compilation, add support for FA2 and FlashInfer #20059

Merged

zixi-qi mentioned this pull request Jul 23, 2025

[v1][spec decode] Run eagle with full cudagraph support #21477

Closed

Uh oh!

[Core] Support full cuda graph in v1 #16072

[Core] Support full cuda graph in v1 #16072

Conversation

chanh commented Apr 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Before

After

Uh oh!

github-actions bot commented Apr 4, 2025

Uh oh!

mergify bot commented Apr 8, 2025

Uh oh!

alexm-redhat commented Apr 8, 2025

Uh oh!

sarckk commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chanh commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexm-redhat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexm-redhat commented Apr 11, 2025

Uh oh!

WoosukKwon commented Apr 11, 2025

Uh oh!

dblincoe commented Apr 11, 2025

Uh oh!

tlrmchlsmth commented May 7, 2025

Uh oh!

LucasWilkinson commented May 7, 2025

Uh oh!

chanh commented May 7, 2025

Uh oh!

chanh commented May 7, 2025

Uh oh!

chanh commented May 7, 2025

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

renjie0 commented May 13, 2025

Uh oh!

Juelianqvq commented May 28, 2025

Uh oh!

ProExpertProg commented May 29, 2025

Uh oh!

hidva Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Lmywl commented Jun 19, 2025

Uh oh!

happierpig commented Jun 19, 2025

Uh oh!

xsank commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

chanh commented Apr 4, 2025 •

edited by github-actions bot

Loading

sarckk commented Apr 9, 2025 •

edited

Loading

chanh commented Apr 9, 2025 •

edited

Loading

hidva Jun 11, 2025 •

edited

Loading

xsank commented Aug 28, 2025 •

edited

Loading