-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
[V1] [Hybrid] Support Minimax-Text-01 in V1 #22151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for the Minimax-Text-01 model in V1, which is a significant step for expanding hybrid model compatibility. The changes correctly adapt the model to the V1 architecture, including updates to the attention mechanisms and KV cache handling. The test suite has also been updated to include this new model, which is great to see.
My review has identified a few areas for improvement, primarily concerning leftover debugging statements that should be removed before merging. Additionally, there's a newly added utility method with an incorrect implementation and documentation that needs to be addressed to ensure code quality and maintainability. Addressing these points will help solidify the changes and maintain the high quality of the codebase.
I'm very interested in the tiny model. Would you be able to share it? |
@qscqesze sure, I'm using this one: https://huggingface.co/Goekdeniz-Guelmez/MiniMax01Text-Dev |
I think you forgot to paste the gsm8k V1 results in the description, or I am just getting ahead of myself since it's still a draft, sorry in that case :) |
@NickLucche yeah haha, I didn't paste them because they don't look good yet lol. Still debugging it |
It seems to be specifically an issue when the blocks start to get recycled. I see good lm_eval results until blocks start running out and getting re-used. To me this suggests some issue with the way the state is being reset. |
Signed-off-by: Thomas Parnell <[email protected]>
681e30e
to
497f22b
Compare
@NickLucche Fixed the problem and updated above :) |
BTW the problem was related to trying to use fp32 for the linear attention state, but fp16 for the normal attention state. This is what happens on V0 right now, but didn't seem to work when I tried it in V1. Not sure if this is some inherent limitation, will think more on that. |
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
I have tested your pull request, the accuracy on GSM8k is 0.920, and the average accuracy on MMLU is 0.846. Deployment command: python3 -m vllm.entrypoints.api_server --model /data/xxx/model/MiniMax-Text-01/ --tensor-parallel-size 8 --trust-remote-code --quantization experts_int8 --max_model_len 4096 --dtype bfloat16 --no-enable-prefix-caching --enforce-eager I used the sglang benchmark scripts. The detailed results are as follows: GSM8k test: python3 bench_other.py --num-questions 500 --num-shots 5 --backend vllm --port 8000 --host http://127.0.0.1
# ...
Accuracy: 0.920
Invalid: 0.000
Latency: 155.861 s MMLU test: python3 bench_other.py --nsub 200 --backend vllm --port 8000 --host http://127.0.0.1
# ...
Total latency: 1511.974
Average accuracy: 0.846 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, but let's wait for someone more familiar with this the model to chime-in.
It's a bit of a pity we can pair this with a v0 deprecation task and just get rid of the v0 branching.
Signed-off-by: Thomas Parnell <[email protected]>
@NickLucche Re: V0 code, I agree in general. There will still be a performance gap to V0 until we merge #21401, so I'm not sure we should remove V0 code until performance is at least matching. I think we may also want to wait until hybrid models are supported using FlashAttention (e.g., after #21549 is merged). |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Thomas Parnell <[email protected]>
6004749
to
54b96ed
Compare
cc @heheda12345 |
I think this code is fine. Great job! Looking forward to its early merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! But can you move all ssm-related attention backends to v1/attention/backends/mamba
? Either in this PR or a follow-up PR is OK for me.
And let's do more refactor to make more integration more plugable in the future.
And thanks minimax team for verifying the correctness of this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, can you open a follow-up PR to update the docs accordingly?
Thank you for the elegant design and meticulous engineering in this work—your code has already become the go-to reference for many of us. I’m currently aligning our in-house, multi-branch hybrid model for an upcoming sync back to main. We also maintain a dual-precision linear cache (some tensors stay in bf16 while others stay in fp32), so your implementation will be an invaluable blueprint for how to handle the mixed-precision paths cleanly. I believe the limitation exits because:
This codes only consider one type? |
Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Paul Pak <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
This PR enables Minimax-Text-01 in vLLM V1. Currently only eager mode is supported.
I've tried to keep the changes minimal. I hope this PR could serve as a reference/template to how to enable other non-mamba2 based hybrid models in V1.
Test Plan
Deploy using V0 from main:
Deploy using V1 from this PR:
Evaluation:
Test Result
V0:
V1:
(Optional) Documentation Update