-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
[Doc] Add max_lora_rank configuration guide #22782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Doc] Add max_lora_rank configuration guide #22782
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Helps users avoid performance loss from memory over-allocation. Signed-off-by: chiliu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a crucial warning for a common LoRA misconfiguration where max_lora_rank
is set higher than the adapter's actual rank, which can lead to significant performance degradation and memory over-allocation. The implementation is clean and placed correctly within the adapter loading logic. The warning message is informative and provides a clear, actionable solution to the user. This is an excellent diagnostic improvement that will help users avoid silent performance issues.
8119439
to
bfa3a52
Compare
We support multi-LoRA, where each LoRA has a different rank - it could be 16, or it could be 256. The |
@jeejeelee You're right - I didn't think through the multi-LoRA dynamic loading case properly. When serving multiple Instead of runtime warnings, would it make sense to add a brief doc note? Something like:
What do you think? |
Nice, we can add the related notes in the end of lora doc |
Thanks for the feedback! I've removed the code changes and only kept the documentation addition as you suggested. |
docs/features/lora.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Configuring `max_lora_rank` | |
## Configuring `max_lora_rank` | |
## Using Tips | |
### Configuring `max_lora_rank` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
docs/features/lora.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- **Example**: If your LoRA adapters have ranks [16, 32, 64], use `--max-lora-rank 64` rather than 256 | |
For example, if your LoRA adapters have ranks [16, 32, 64], use `--max-lora-rank 64` rather than 256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
Signed-off-by: chiliu <[email protected]>
597ee35
to
b6159ea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you for contribution
Signed-off-by: chiliu <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: chiliu <[email protected]>
Signed-off-by: chiliu <[email protected]>
Signed-off-by: chiliu <[email protected]>
Signed-off-by: chiliu <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
Signed-off-by: chiliu <[email protected]>
Purpose
Add documentation to help users properly configure
max_lora_rank
parameter to avoid performance issues.Context
As discussed with @jeejeelee , in multi-LoRA serving scenarios,
max_lora_rank
must be set to accommodate the largest rank among all LoRA adapters that might be loaded. However, setting it unnecessarily high wastes memory and can cause performance issues.Changes
Added a new section "Configuring
max_lora_rank
" at the end ofdocs/features/lora.md
that explains:Case:
max_lora_rank
is 16Why this causes performance degradation:
max_lora_rank
for all LoRA operationsmax_lora_rank > actual rank
, the system allocates more tensorsPerformance comparison (production workload A100-80GB):
max_lora_rank=256
: 3.649s per requestmax_lora_rank=16
: 2.292s per requestMemory usage:
max_lora_rank=256
: 3.18GB GPU memory for LoRAmax_lora_rank=16
: 0.20GB GPU memory for LoRA(Optional) Documentation Update
Not required - this is a diagnostic improvement only.