[Bugfix] Fix device ordinal when initializing spec_decode_sampler under multi-node setup #13269

ShangmingCai · 2025-02-14T07:30:11Z

This PR fix #12841.

…er multi-node setup Signed-off-by: Shangming Cai <[email protected]>

github-actions · 2025-02-14T07:30:23Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Shangming Cai <[email protected]>

ShangmingCai · 2025-02-14T10:36:57Z

SpecDecodeBaseSampler should use local_rank to init_tensors under distributed inferencing with multiple nodes, instead of rank.

For example, if tp is set to 16, then the worker node will use "cuda:8" to "cuda:15" to init tensor and raise device ordinal error in current implementation. I have verified this PR could fix the bug in 2 nodes with NVIDIA H20 x 8. But I'm not sure if this is the most appropriate solution. I tried to change the device logic of ininit_tensors directly, but after all, the device is passed in from SpecDecodeWorker, so I think the current modification in this PR is better for readability.

CC list: @youkaichao @LiuXiaoxuanPKU

youkaichao

thanks for the fix!

youkaichao · 2025-02-19T14:12:47Z

FYI @LiuXiaoxuanPKU

simon-mo · 2025-02-20T01:30:31Z

@youkaichao @ShangmingCai, this seems to break the tests https://buildkite.com/vllm/ci/builds/13743#01951f87-e956-47dc-8c32-3faee9c97af1/6-12975 with error


[2025-02-19T20:40:07Z] /usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py:744: AssertionError
--
  | [2025-02-19T20:40:07Z] _________________ test_init_device[typical_acceptance_sampler] _________________
  | [2025-02-19T20:40:07Z]
  | [2025-02-19T20:40:07Z] acceptance_sampler_method = 'typical_acceptance_sampler'
  | [2025-02-19T20:40:07Z]
  | [2025-02-19T20:40:07Z]     @pytest.mark.parametrize("acceptance_sampler_method",
  | [2025-02-19T20:40:07Z]                              ["rejection_sampler", "typical_acceptance_sampler"])
  | [2025-02-19T20:40:07Z]     @pytest.mark.skip_global_cleanup
  | [2025-02-19T20:40:07Z]     def test_init_device(acceptance_sampler_method: str):
  | [2025-02-19T20:40:07Z]         """Verify SpecDecodeWorker invokes proposer/scorer worker init_device, as
  | [2025-02-19T20:40:07Z]         well as other GPU initialization.
  | [2025-02-19T20:40:07Z]         """
  | [2025-02-19T20:40:07Z]         draft_worker = mock_worker(cls=MultiStepWorker, use_spec=False)
  | [2025-02-19T20:40:07Z]         target_worker = mock_worker(use_spec=False)
  | [2025-02-19T20:40:07Z]         spec_decode_sampler = mock_spec_decode_sampler(acceptance_sampler_method)
  | [2025-02-19T20:40:07Z]         metrics_collector = MagicMock(spec=AsyncMetricsCollector)
  | [2025-02-19T20:40:07Z]
  | [2025-02-19T20:40:07Z]         worker = SpecDecodeWorker(
  | [2025-02-19T20:40:07Z]             proposer_worker=draft_worker,
  | [2025-02-19T20:40:07Z]             scorer_worker=target_worker,
  | [2025-02-19T20:40:07Z]             spec_decode_sampler=spec_decode_sampler,
  | [2025-02-19T20:40:07Z]             disable_logprobs=False,
  | [2025-02-19T20:40:07Z]             metrics_collector=metrics_collector,
  | [2025-02-19T20:40:07Z]         )
  | [2025-02-19T20:40:07Z] >       worker.init_device()
  | [2025-02-19T20:40:07Z]
  | [2025-02-19T20:40:07Z] spec_decode/test_spec_decode_worker.py:597:
  | [2025-02-19T20:40:07Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  | [2025-02-19T20:40:07Z] /usr/local/lib/python3.12/dist-packages/vllm/spec_decode/spec_decode_worker.py:369: in init_device
  | [2025-02-19T20:40:07Z]     self.spec_decode_sampler.init_tensors(get_tp_group().local_rank,
  | [2025-02-19T20:40:07Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  | [2025-02-19T20:40:07Z]
  | [2025-02-19T20:40:07Z]     def get_tp_group() -> GroupCoordinator:
  | [2025-02-19T20:40:07Z] >       assert _TP is not None, ("tensor model parallel group is not initialized")
  | [2025-02-19T20:40:07Z] E       AssertionError: tensor model parallel group is not initialized
  | [2025-02-19T20:40:07Z]
  | [2025-02-19T20:40:07Z] /usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py:744: AssertionError

Do you think the change in #13578 make sense?

ShangmingCai · 2025-02-20T02:01:00Z

Do you think the change in #13578 make sense?

@simon-mo Thanks for the fix. It totally makes sense. Sorry for not considering the case when tp is not activated. Maybe we can optimize the logic of spec_decode_sampler.init_device in the future to make the code cleaner, but I think these fixes are fine for now considering many people are trying multi-node inference.

…13269) Signed-off-by: Shangming Cai <[email protected]>

…13269) Signed-off-by: Shangming Cai <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

…13269) Signed-off-by: Shangming Cai <[email protected]>

[Bugfix] Fix device ordinal when initializing spec_decode_sampler und…

9e5852e

…er multi-node setup Signed-off-by: Shangming Cai <[email protected]>

mergify bot added the speculative-decoding label Feb 14, 2025

retrigger ci

315a784

Signed-off-by: Shangming Cai <[email protected]>

youkaichao requested a review from LiuXiaoxuanPKU February 14, 2025 12:07

Merge branch 'main' into fix_sampler_init

349b178

youkaichao approved these changes Feb 19, 2025

View reviewed changes

youkaichao merged commit 5ae9f26 into vllm-project:main Feb 19, 2025
18 checks passed

simon-mo mentioned this pull request Feb 20, 2025

[bugfix] spec decode worker get tp group only when initialized #13578

Merged

xjpang pushed a commit to xjpang/vllm that referenced this pull request Feb 20, 2025

[Bugfix] Fix device ordinal for multi-node spec decode (vllm-project#…

a9ef70f

…13269) Signed-off-by: Shangming Cai <[email protected]>

Akshat-Tripathi pushed a commit to krai/vllm that referenced this pull request Mar 3, 2025

[Bugfix] Fix device ordinal for multi-node spec decode (vllm-project#…

4b2dd14

…13269) Signed-off-by: Shangming Cai <[email protected]>

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[Bugfix] Fix device ordinal for multi-node spec decode (vllm-project#…

37534ca

…13269) Signed-off-by: Shangming Cai <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[Bugfix] Fix device ordinal for multi-node spec decode (vllm-project#…

5311c5e

…13269) Signed-off-by: Shangming Cai <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Fix device ordinal when initializing spec_decode_sampler under multi-node setup #13269

[Bugfix] Fix device ordinal when initializing spec_decode_sampler under multi-node setup #13269

Uh oh!

ShangmingCai commented Feb 14, 2025

Uh oh!

github-actions bot commented Feb 14, 2025

Uh oh!

ShangmingCai commented Feb 14, 2025

Uh oh!

youkaichao left a comment

Uh oh!

youkaichao commented Feb 19, 2025

Uh oh!

Uh oh!

simon-mo commented Feb 20, 2025

Uh oh!

ShangmingCai commented Feb 20, 2025

Uh oh!

Uh oh!

Uh oh!

[Bugfix] Fix device ordinal when initializing spec_decode_sampler under multi-node setup #13269

[Bugfix] Fix device ordinal when initializing spec_decode_sampler under multi-node setup #13269

Uh oh!

Conversation

ShangmingCai commented Feb 14, 2025

Uh oh!

github-actions bot commented Feb 14, 2025

Uh oh!

ShangmingCai commented Feb 14, 2025

Uh oh!

youkaichao left a comment

Choose a reason for hiding this comment

Uh oh!

youkaichao commented Feb 19, 2025

Uh oh!

Uh oh!

simon-mo commented Feb 20, 2025

Uh oh!

ShangmingCai commented Feb 20, 2025

Uh oh!

Uh oh!