-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
[Core] Split LoRA layers #24574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Split LoRA layers #24574
Conversation
Signed-off-by: Jee Jee Li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the LoRA layers by splitting a large layers.py
file into a more organized package structure. This is a great improvement for maintainability. I've found a critical issue that could break support for Marlin quantized models, and a high-severity typo in a new module's name that should be corrected.
def _get_lora_device(base_layer: nn.Module) -> torch.device: | ||
# code borrowed from https://github.com/fmmoret/vllm/blob/fm-support-lora-on-quantized-models/vllm/lora/layers.py#L34 | ||
"""Returns the device for where to place the LoRA tensors.""" | ||
# unquantizedLinear | ||
if hasattr(base_layer, "weight"): | ||
return base_layer.weight.device | ||
# Compressed Tensor | ||
elif hasattr(base_layer, "weight_packed"): | ||
return base_layer.weight_packed.device | ||
# GPTQ/AWQ | ||
elif hasattr(base_layer, "qweight"): | ||
return base_layer.qweight.device | ||
# HQQ marlin | ||
elif hasattr(base_layer, "W_q"): | ||
return base_layer.W_q.device | ||
else: | ||
raise ValueError(f"Unsupported base layer: {base_layer}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _get_lora_device
function is missing a check for the B
attribute, which is used by Marlin quantized models. This can cause a ValueError
when using LoRA with Marlin quantization, as the function would fail to determine the correct device. The weight
property in BaseLinearLayerWithLoRA
correctly handles this case, and this function should be consistent with it to avoid a regression.
def _get_lora_device(base_layer: nn.Module) -> torch.device: | |
# code borrowed from https://github.com/fmmoret/vllm/blob/fm-support-lora-on-quantized-models/vllm/lora/layers.py#L34 | |
"""Returns the device for where to place the LoRA tensors.""" | |
# unquantizedLinear | |
if hasattr(base_layer, "weight"): | |
return base_layer.weight.device | |
# Compressed Tensor | |
elif hasattr(base_layer, "weight_packed"): | |
return base_layer.weight_packed.device | |
# GPTQ/AWQ | |
elif hasattr(base_layer, "qweight"): | |
return base_layer.qweight.device | |
# HQQ marlin | |
elif hasattr(base_layer, "W_q"): | |
return base_layer.W_q.device | |
else: | |
raise ValueError(f"Unsupported base layer: {base_layer}") | |
def _get_lora_device(base_layer: nn.Module) -> torch.device: | |
# code borrowed from https://github.com/fmmoret/vllm/blob/fm-support-lora-on-quantized-models/vllm/lora/layers.py#L34 | |
"""Returns the device for where to place the LoRA tensors.""" | |
# unquantizedLinear | |
if hasattr(base_layer, "weight"): | |
return base_layer.weight.device | |
# Compressed Tensor | |
elif hasattr(base_layer, "weight_packed"): | |
return base_layer.weight_packed.device | |
# GPTQ/AWQ | |
elif hasattr(base_layer, "qweight"): | |
return base_layer.qweight.device | |
# marlin | |
elif hasattr(base_layer, "B"): | |
return base_layer.B.device | |
# HQQ marlin | |
elif hasattr(base_layer, "W_q"): | |
return base_layer.W_q.device | |
else: | |
raise ValueError(f"Unsupported base layer: {base_layer}") |
vllm/lora/layers/__init__.py
Outdated
from vllm.lora.layers.vocal_parallel_embedding import ( | ||
VocabParallelEmbeddingWithLoRA, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There appears to be a typo in the module name vocal_parallel_embedding
. It should likely be vocab_parallel_embedding
to be consistent with the class name VocabParallelEmbeddingWithLoRA
and the general terminology. This should be corrected in the filename (vllm/lora/layers/vocal_parallel_embedding.py
) and here in the import statement for better code clarity and maintainability.
from vllm.lora.layers.vocal_parallel_embedding import ( | |
VocabParallelEmbeddingWithLoRA, | |
) | |
from vllm.lora.layers.vocab_parallel_embedding import ( | |
VocabParallelEmbeddingWithLoRA, | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also break this file's implementations into column_parallel_linear.py
and row_parallel_linear.py
etc to keep them in same file with parent class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
Signed-off-by: Jee Jee Li <[email protected]>
1792464
to
52f3e3a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM!
Signed-off-by: Jee Jee Li <[email protected]>
Signed-off-by: Jee Jee Li <[email protected]>
Signed-off-by: Jee Jee Li <[email protected]>
Purpose
Split LoRA layers into separate files to facilitate maintenance and improve readability.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.