-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
[Bugfix] Fix device ordinal when initializing spec_decode_sampler under multi-node setup #13269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…er multi-node setup Signed-off-by: Shangming Cai <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Shangming Cai <[email protected]>
For example, if tp is set to 16, then the worker node will use "cuda:8" to "cuda:15" to init tensor and raise device ordinal error in current implementation. I have verified this PR could fix the bug in 2 nodes with NVIDIA H20 x 8. But I'm not sure if this is the most appropriate solution. I tried to change the device logic of CC list: @youkaichao @LiuXiaoxuanPKU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the fix!
FYI @LiuXiaoxuanPKU |
@youkaichao @ShangmingCai, this seems to break the tests https://buildkite.com/vllm/ci/builds/13743#01951f87-e956-47dc-8c32-3faee9c97af1/6-12975 with error
Do you think the change in #13578 make sense? |
@simon-mo Thanks for the fix. It totally makes sense. Sorry for not considering the case when tp is not activated. Maybe we can optimize the logic of |
…13269) Signed-off-by: Shangming Cai <[email protected]>
…13269) Signed-off-by: Shangming Cai <[email protected]>
…13269) Signed-off-by: Shangming Cai <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
…13269) Signed-off-by: Shangming Cai <[email protected]>
This PR fix #12841.