Skip to content

Conversation

ShangmingCai
Copy link
Contributor

This PR fix #12841.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Shangming Cai <[email protected]>
@ShangmingCai
Copy link
Contributor Author

SpecDecodeBaseSampler should use local_rank to init_tensors under distributed inferencing with multiple nodes, instead of rank.

For example, if tp is set to 16, then the worker node will use "cuda:8" to "cuda:15" to init tensor and raise device ordinal error in current implementation. I have verified this PR could fix the bug in 2 nodes with NVIDIA H20 x 8. But I'm not sure if this is the most appropriate solution. I tried to change the device logic of ininit_tensors directly, but after all, the device is passed in from SpecDecodeWorker, so I think the current modification in this PR is better for readability.

CC list: @youkaichao @LiuXiaoxuanPKU

Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the fix!

@youkaichao
Copy link
Member

FYI @LiuXiaoxuanPKU

@youkaichao youkaichao merged commit 5ae9f26 into vllm-project:main Feb 19, 2025
18 checks passed
@simon-mo
Copy link
Collaborator

@youkaichao @ShangmingCai, this seems to break the tests https://buildkite.com/vllm/ci/builds/13743#01951f87-e956-47dc-8c32-3faee9c97af1/6-12975 with error


[2025-02-19T20:40:07Z] /usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py:744: AssertionError
--
  | [2025-02-19T20:40:07Z] _________________ test_init_device[typical_acceptance_sampler] _________________
  | [2025-02-19T20:40:07Z]
  | [2025-02-19T20:40:07Z] acceptance_sampler_method = 'typical_acceptance_sampler'
  | [2025-02-19T20:40:07Z]
  | [2025-02-19T20:40:07Z]     @pytest.mark.parametrize("acceptance_sampler_method",
  | [2025-02-19T20:40:07Z]                              ["rejection_sampler", "typical_acceptance_sampler"])
  | [2025-02-19T20:40:07Z]     @pytest.mark.skip_global_cleanup
  | [2025-02-19T20:40:07Z]     def test_init_device(acceptance_sampler_method: str):
  | [2025-02-19T20:40:07Z]         """Verify SpecDecodeWorker invokes proposer/scorer worker init_device, as
  | [2025-02-19T20:40:07Z]         well as other GPU initialization.
  | [2025-02-19T20:40:07Z]         """
  | [2025-02-19T20:40:07Z]         draft_worker = mock_worker(cls=MultiStepWorker, use_spec=False)
  | [2025-02-19T20:40:07Z]         target_worker = mock_worker(use_spec=False)
  | [2025-02-19T20:40:07Z]         spec_decode_sampler = mock_spec_decode_sampler(acceptance_sampler_method)
  | [2025-02-19T20:40:07Z]         metrics_collector = MagicMock(spec=AsyncMetricsCollector)
  | [2025-02-19T20:40:07Z]
  | [2025-02-19T20:40:07Z]         worker = SpecDecodeWorker(
  | [2025-02-19T20:40:07Z]             proposer_worker=draft_worker,
  | [2025-02-19T20:40:07Z]             scorer_worker=target_worker,
  | [2025-02-19T20:40:07Z]             spec_decode_sampler=spec_decode_sampler,
  | [2025-02-19T20:40:07Z]             disable_logprobs=False,
  | [2025-02-19T20:40:07Z]             metrics_collector=metrics_collector,
  | [2025-02-19T20:40:07Z]         )
  | [2025-02-19T20:40:07Z] >       worker.init_device()
  | [2025-02-19T20:40:07Z]
  | [2025-02-19T20:40:07Z] spec_decode/test_spec_decode_worker.py:597:
  | [2025-02-19T20:40:07Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  | [2025-02-19T20:40:07Z] /usr/local/lib/python3.12/dist-packages/vllm/spec_decode/spec_decode_worker.py:369: in init_device
  | [2025-02-19T20:40:07Z]     self.spec_decode_sampler.init_tensors(get_tp_group().local_rank,
  | [2025-02-19T20:40:07Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  | [2025-02-19T20:40:07Z]
  | [2025-02-19T20:40:07Z]     def get_tp_group() -> GroupCoordinator:
  | [2025-02-19T20:40:07Z] >       assert _TP is not None, ("tensor model parallel group is not initialized")
  | [2025-02-19T20:40:07Z] E       AssertionError: tensor model parallel group is not initialized
  | [2025-02-19T20:40:07Z]
  | [2025-02-19T20:40:07Z] /usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py:744: AssertionError


Do you think the change in #13578 make sense?

@ShangmingCai
Copy link
Contributor Author

Do you think the change in #13578 make sense?

@simon-mo Thanks for the fix. It totally makes sense. Sorry for not considering the case when tp is not activated. Maybe we can optimize the logic of spec_decode_sampler.init_device in the future to make the code cleaner, but I think these fixes are fine for now considering many people are trying multi-node inference.

xjpang pushed a commit to xjpang/vllm that referenced this pull request Feb 20, 2025
Akshat-Tripathi pushed a commit to krai/vllm that referenced this pull request Mar 3, 2025
lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Speculative decoding reports errors when loading target model using distributed inference (VLLM's offical Ray setup)
3 participants