-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
fix biachuan-7b tp #598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix biachuan-7b tp #598
Conversation
Is the same reason for baichuan-13b? #530 |
Yes. I have tested it on both baichuan13b and 7b, and it can output normal output under tp. |
Can I use this PR directly on 13B? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution! Can you use our official formatting script and remove other additional format changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this part the only part that actually changes the code logic? Can you remove other format-only modifications and use format.sh
script provided by us to re-format the code? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I have already modified the content of the PR and removed the invalid format part.
356793c
to
aeb2d9e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for your contribution!
Co-authored-by: wq.chu <[email protected]>
…ct#598) ### What this PR does / why we need it? Deepseek v3 now adopt vanilla chunked prefill on MLA part which is ineffcient for computing but necessary for chunked prefill. Since PR vllm-project/vllm-ascend#543 bring v0 scheduler into vllm-ascend, we can now adopt torch_npu._npu_flash_attention inside the mla backend for more performance boost. Also there are some redundant computation inside the rope, which is also removed. This PR should bring some performance gain for deepseek eager mode inference. --------- Signed-off-by: ganyi <[email protected]>
The main modifications are in the "load_weights" function.
Before:

After:
