[V1] LoRA - Add triton kernels for V1 #13096

varun-sundar-rabindranath · 2025-02-11T16:03:56Z

Add shrink and expand triton kernels for V1.

Why do we need a new set of kernels:

V0 sorts/groups requests based on LoRA ID. The SGMV kernels take advantage of this and groups the compute within thread blocks.
V1 doesn't group requests based on LoRA ID. The new set of kernels have information about which input tokens map to which LoRA ID and they use this information to load the appropriate input tokens. The rest of the matmul is very similar to the SGMV kernels.

Kernel Code Change:
The new kernels re-use a lot of the code from the existing SGMV kernels. The main changes are,

Kernel Launch Grid formulation (this was required so the kernels are CUDAGraph compatible. Note that SGMV kernels are not)
Loading of the input tokens (A matrix) for the matmul.
All other kernel code is the same as the existing SGMV kernels. I refactored the code so it can be reused.

benchmark serving numbers :

server command : VLLM_USE_V1="1" vllm serve meta-llama/Llama-2-7b-hf --max-loras 4 --max-lora-rank 8 --enable-lora --lora-modules lora1=yard1/llama-2-7b-sql-lora-test lora2=yard1/llama-2-7b-sql-lora-test lora3=yard1/llama-2-7b-sql-lora-test lora4=yard1/llama-2-7b-sql-lora-test --no-enable-prefix-caching

benchmark command : python3 benchmarks/benchmark_serving.py --model meta-llama/Llama-2-7b-hf --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 500 --request-rate inf --lora-modules lora1 lora2 lora3 lora4

V1 LoRA - This PR:

============ Serving Benchmark Result ============
Successful requests:                     500       
Benchmark duration (s):                  68.81     
Total input tokens:                      117316    
Total generated tokens:                  110942    
Request throughput (req/s):              7.27      
Output token throughput (tok/s):         1612.20   
Total Token throughput (tok/s):          3317.02   
---------------Time to First Token----------------
Mean TTFT (ms):                          7103.84   
Median TTFT (ms):                        7136.82   
P99 TTFT (ms):                           14040.12  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          126.93    
Median TPOT (ms):                        94.96     
P99 TPOT (ms):                           231.31    
---------------Inter-token Latency----------------
Mean ITL (ms):                           76.47     
Median ITL (ms):                         58.02     
P99 ITL (ms):                            236.28    
==================================================

V1 LoRA - Main:

============ Serving Benchmark Result ============
Successful requests:                     500       
Benchmark duration (s):                  117.84    
Total input tokens:                      117316    
Total generated tokens:                  110942    
Request throughput (req/s):              4.24      
Output token throughput (tok/s):         941.44    
Total Token throughput (tok/s):          1936.96   
---------------Time to First Token----------------
Mean TTFT (ms):                          10277.45  
Median TTFT (ms):                        9370.56   
P99 TTFT (ms):                           22882.99  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          259.87    
Median TPOT (ms):                        236.82    
P99 TPOT (ms):                           445.65    
---------------Inter-token Latency----------------
Mean ITL (ms):                           173.39    
Median ITL (ms):                         164.76    
P99 ITL (ms):                            459.73    
==================================================

Kernel micro benchmark:

Please find kernel microbenchmark here - https://docs.google.com/spreadsheets/d/1b_8KsDGdiSGWlHODMszug_-do7OlSPlSzoV84VkmVPc/edit?usp=sharing (sheet : "V1 : Dont Sort Tokens By LoRA ")

Note : The V0 SGMV and BGMV kernels are not tuned. But the V1 kernels are tuned with triton auto-tuner. Therefore the discrepancy between the V1 and SGMV/BGMV kernels could be partially explained by the tuning.
The SGMV kernel depends heavily on the input being sorted. V1 kernels aren't affected as much.

github-actions · 2025-02-11T16:04:10Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

varun-sundar-rabindranath · 2025-02-20T14:45:18Z

vllm/lora/ops/triton_ops/v1/utils.py

@@ -0,0 +1,102 @@
+# SPDX-License-Identifier: Apache-2.0
+


Requesting reviews on this file. The utilities in this file deal with loading the stored triton configs.
cc @tlrmchlsmth @mgoin

QQ: Is there a significant gain to stored tuned config?

Not significant - but I do see some gains. On average I see about 200 tokens/s more throughput when using the tuned configs.

TBH, storing these configs is a bit crazy, I'm not sure if this is the right direction.

It appears, it is common to have these configs for triton kernels - the fused_moe kernels also have such configs here https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/layers/fused_moe/configs

Arguments for storing configs:

The default/static config isn't going to be optimal for all cases.

Generating these configs is easy. With new GPUs and Triton version updates we can simply run the triton autotuner and drop in these configs without having to do any code change.

I did a tuned-kernels vs untuned-kernels micro benchmark run -
https://docs.google.com/spreadsheets/d/1f87TKWuwJXVK2-8YHGBoVbo6ueEWGpLPPUIkgYs6dMo/edit?usp=sharing .
TLDR the tuned shrink kernels are better than the untuned versions in most cases. For expand kernels the tuned version is much better at low batch size regimes

E2E performance:

The 200 tokens / s number I shared is on the A100 GPU. The E2E performance is limited in this case. But I believe with CUDA Graphs enabled, the tuned kernels will have a bigger impact.

Also, I haven't checked the E2E performance on H100 GPU. it might be better.

What do you think ?

Spoke with @jeejeelee IRL - The plan is to remove the triton configs from this PR and introduce them in a separate PR so we can reason about them separately.

@jeejeelee I have removed the configs from this PR. can you take another pass at the PR when you find the time please ! Thanks 🙏

varun-sundar-rabindranath · 2025-02-20T15:32:31Z

vllm/lora/punica_wrapper/punica_gpu.py

+            self._v1_prepare_metadata_tensors(self.token_lora_indices,
+                                              self.sampler_indices)
+        else:
+            # Forward to base class update_metadata


Is there a better way to call the base class method ? (Note that this class inherits from multiple classes. )

Maybe super().update_metadata is better

jeejeelee · 2025-02-28T12:37:28Z

All the LoRA tests have failed again

varun-sundar-rabindranath · 2025-02-28T12:41:11Z

All the LoRA tests have failed again

Looking into this now 👍

varun-sundar-rabindranath · 2025-03-03T13:18:21Z

Update : I enabled tests in tests/lora/test_layers.py for V1. The tests work locally but OOM's on the CI - I am tracking this down.

mergify · 2025-03-03T13:57:27Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

jeejeelee · 2025-03-04T05:18:40Z

It seems these modifications have significantly increased the time consumption for lora testing

tests/lora/test_punica_ops.py

varun-sundar-rabindranath · 2025-03-04T13:08:12Z

It seems these modifications have significantly increased the time consumption for lora testing

Yes. This PR adds the v1_kernel tests in test_punica_ops.py and enables test_layers.py to run for V1 also. I believe most of it is coming from the test_layers.py that now runs for both V1 and V0 (effectively doubling its run time) - Ill see what we can do here.

[Edit]
@jeejeelee

Update : Reduced the tests in commits a18d273 and ba94947
The times are now,

a maximum of 7 minute increase. Do you think we should prune further ?

vllm/lora/punica_wrapper/punica_gpu.py

jeejeelee

Thank you very much. Let's continue moving forward in the direction we discussed on Slack.

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>

varun-sundar-rabindranath requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners February 11, 2025 16:03

varun-sundar-rabindranath marked this pull request as draft February 11, 2025 16:04

mergify bot added the v1 label Feb 11, 2025

varun-sundar-rabindranath mentioned this pull request Feb 11, 2025

[Kernel] LoRA - Refactor sgmv kernels #13110

Merged

varun-sundar-rabindranath force-pushed the varun/v1-lora-kernels branch 2 times, most recently from 1e6caf2 to d78cd57 Compare February 20, 2025 13:02

varun-sundar-rabindranath marked this pull request as ready for review February 20, 2025 14:15

varun-sundar-rabindranath commented Feb 20, 2025

View reviewed changes

varun-sundar-rabindranath mentioned this pull request Feb 20, 2025

[Do Not Merge] - LoRA V1 Reference PR #11613

Closed

varun-sundar-rabindranath commented Feb 20, 2025

View reviewed changes

varun-sundar-rabindranath force-pushed the varun/v1-lora-kernels branch from 1542d94 to 5d2caca Compare February 28, 2025 08:16

varun-sundar-rabindranath force-pushed the varun/v1-lora-kernels branch from 1da960e to af77fb1 Compare March 1, 2025 05:19

mergify bot added the needs-rebase label Mar 3, 2025

varun-sundar-rabindranath force-pushed the varun/v1-lora-kernels branch 2 times, most recently from 6b9fadf to 2e9eb8b Compare March 3, 2025 21:44

mergify bot removed the needs-rebase label Mar 3, 2025

jeejeelee reviewed Mar 4, 2025

View reviewed changes

tests/lora/test_punica_ops.py Outdated Show resolved Hide resolved

jeejeelee reviewed Mar 5, 2025

View reviewed changes

vllm/lora/punica_wrapper/punica_gpu.py Outdated Show resolved Hide resolved

jeejeelee approved these changes Mar 10, 2025

View reviewed changes

jeejeelee added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 10, 2025

Varun Sundar Rabindranath added 20 commits March 10, 2025 11:06

Add v1 kernels

6922801

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fix isort

9dc277d

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fix tests

3cd67c4

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fix comments

491d5f2

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fix punica base

22571f2

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fix comment

9cc1031

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fix base class call

119d241

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

avoid updates to punica base

76bc6f4

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fix tests_layers

beb8f08

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fix test_layers.py

ccfc6da

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fix expand kernel

103b1c3

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fixes

b245820

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fix oom

c73c780

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

nit

166ed7f

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

nits

00964ec

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

lru_cache nits

71bb12e

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

merge punica ops tests

2041ff7

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fix test_layers test times

567e33f

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

remove configs

1a68338

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fixes

a066402

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

varun-sundar-rabindranath force-pushed the varun/v1-lora-kernels branch from c6bffe1 to a066402 Compare March 10, 2025 15:07

robertgshaw2-redhat merged commit 5ff0d32 into vllm-project:main Mar 10, 2025
36 checks passed

chenhongyu2048 mentioned this pull request Mar 20, 2025

[Bug]: Capture CudaGraph with LoRA #15090

Closed

1 task

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[V1] LoRA - Add triton kernels for V1 (vllm-project#13096)

4ace1be

Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>

Uh oh!

[V1] LoRA - Add triton kernels for V1 #13096

[V1] LoRA - Add triton kernels for V1 #13096

Uh oh!

Conversation

varun-sundar-rabindranath commented Feb 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeejeelee commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

varun-sundar-rabindranath commented Feb 28, 2025

Uh oh!

varun-sundar-rabindranath commented Mar 3, 2025

Uh oh!

mergify bot commented Mar 3, 2025

Uh oh!

jeejeelee commented Mar 4, 2025

Uh oh!

Uh oh!

varun-sundar-rabindranath commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jeejeelee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

varun-sundar-rabindranath commented Feb 11, 2025 •

edited by github-actions bot

Loading

jeejeelee commented Feb 28, 2025 •

edited

Loading

varun-sundar-rabindranath commented Mar 4, 2025 •

edited

Loading