-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
[V1] LoRA - Add triton kernels for V1 #13096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] LoRA - Add triton kernels for V1 #13096
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
1e6caf2
to
d78cd57
Compare
vllm/lora/ops/triton_ops/v1/utils.py
Outdated
@@ -0,0 +1,102 @@ | |||
# SPDX-License-Identifier: Apache-2.0 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting reviews on this file. The utilities in this file deal with loading the stored triton configs.
cc @tlrmchlsmth @mgoin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: Is there a significant gain to stored tuned config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not significant - but I do see some gains. On average I see about 200 tokens/s more throughput when using the tuned configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, storing these configs is a bit crazy, I'm not sure if this is the right direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears, it is common to have these configs for triton kernels - the fused_moe kernels also have such configs here https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/layers/fused_moe/configs
Arguments for storing configs:
- The default/static config isn't going to be optimal for all cases.
- Generating these configs is easy. With new GPUs and Triton version updates we can simply run the triton autotuner and drop in these configs without having to do any code change.
I did a tuned-kernels vs untuned-kernels micro benchmark run -
https://docs.google.com/spreadsheets/d/1f87TKWuwJXVK2-8YHGBoVbo6ueEWGpLPPUIkgYs6dMo/edit?usp=sharing .
TLDR the tuned shrink kernels are better than the untuned versions in most cases. For expand kernels the tuned version is much better at low batch size regimes
E2E performance:
- The 200 tokens / s number I shared is on the A100 GPU. The E2E performance is limited in this case. But I believe with CUDA Graphs enabled, the tuned kernels will have a bigger impact.
- Also, I haven't checked the E2E performance on H100 GPU. it might be better.
What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spoke with @jeejeelee IRL - The plan is to remove the triton configs from this PR and introduce them in a separate PR so we can reason about them separately.
@jeejeelee I have removed the configs from this PR. can you take another pass at the PR when you find the time please ! Thanks 🙏
self._v1_prepare_metadata_tensors(self.token_lora_indices, | ||
self.sampler_indices) | ||
else: | ||
# Forward to base class update_metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better way to call the base class method ? (Note that this class inherits from multiple classes. )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe super().update_metadata
is better
1542d94
to
5d2caca
Compare
All the LoRA tests have failed again |
Looking into this now 👍 |
1da960e
to
af77fb1
Compare
Update : I enabled tests in |
This pull request has merge conflicts that must be resolved before it can be |
6b9fadf
to
2e9eb8b
Compare
Yes. This PR adds the v1_kernel tests in test_punica_ops.py and enables [Edit] Update : Reduced the tests in commits a18d273 and ba94947 a maximum of 7 minute increase. Do you think we should prune further ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much. Let's continue moving forward in the direction we discussed on Slack.
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
c6bffe1
to
a066402
Compare
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Add shrink and expand triton kernels for V1.
Why do we need a new set of kernels:
Kernel Code Change:
The new kernels re-use a lot of the code from the existing SGMV kernels. The main changes are,
All other kernel code is the same as the existing SGMV kernels. I refactored the code so it can be reused.
benchmark serving numbers :
server command :
VLLM_USE_V1="1" vllm serve meta-llama/Llama-2-7b-hf --max-loras 4 --max-lora-rank 8 --enable-lora --lora-modules lora1=yard1/llama-2-7b-sql-lora-test lora2=yard1/llama-2-7b-sql-lora-test lora3=yard1/llama-2-7b-sql-lora-test lora4=yard1/llama-2-7b-sql-lora-test --no-enable-prefix-caching
benchmark command :
python3 benchmarks/benchmark_serving.py --model meta-llama/Llama-2-7b-hf --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 500 --request-rate inf --lora-modules lora1 lora2 lora3 lora4
V1 LoRA - This PR:
V1 LoRA - Main:
Kernel micro benchmark:
Please find kernel microbenchmark here - https://docs.google.com/spreadsheets/d/1b_8KsDGdiSGWlHODMszug_-do7OlSPlSzoV84VkmVPc/edit?usp=sharing (sheet : "V1 : Dont Sort Tokens By LoRA ")
Note : The V0 SGMV and BGMV kernels are not tuned. But the V1 kernels are tuned with triton auto-tuner. Therefore the discrepancy between the V1 and SGMV/BGMV kernels could be partially explained by the tuning.
The SGMV kernel depends heavily on the input being sorted. V1 kernels aren't affected as much.