[v1] torchrun compatibility #13642

youkaichao · 2025-02-21T02:54:32Z

continue of #12071 in v1.

some changes to notice:

we need to disable VLLM_ENABLE_V1_MULTIPROCESSING so that the engine lives in the same process as the LLM class, which is required by RLHF framework https://github.com/volcengine/verl . this also reduces the scheduling non-determinism. (cc @robertgshaw2-redhat to confirm, in this case, can we guarantee that all calls of llm.generate will produce the same scheduling decision?)
some misc changes to fix the compatibility of user interface, and some code that does not take care of ExecutorWithExternalLauncher .

Signed-off-by: youkaichao <[email protected]>

github-actions · 2025-02-21T02:54:43Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

youkaichao · 2025-02-21T02:56:44Z

tests/distributed/test_torchrun_example.py

@@ -47,6 +47,9 @@ def test_consistent_across_ranks(obj):
    llm.llm_engine.vllm_config.cache_config.num_cpu_blocks)
 test_consistent_across_ranks(
    llm.llm_engine.vllm_config.cache_config.num_gpu_blocks)
+params = list(llm.llm_engine.model_executor.driver_worker.worker.model_runner.


this is to test if we can directly access the model wih llm.llm_engine.model_executor.driver_worker.worker.model_runner.model . it is used in https://github.com/volcengine/verl/blob/0a1b16f800c25ac80504038fd8b8be4282d6c606/verl/workers/sharding_manager/fsdp_vllm.py#L84

Maybe worth a comment?

Signed-off-by: youkaichao <[email protected]>

robertgshaw2-redhat · 2025-02-21T02:59:03Z

Yes, this should cause deterministic scheduling.

Separately, do you think we can switch from an ENV variable to an EngineArg?

Signed-off-by: youkaichao <[email protected]>

youkaichao · 2025-02-21T03:01:14Z

vllm/worker/worker_base.py

@@ -567,6 +567,10 @@ def init_worker(self, all_kwargs: List[Dict[str, Any]]) -> None:
            self.worker = worker_class(**kwargs)
            assert self.worker is not None

+    def initialize_from_config(self, kv_cache_configs: List[Any]) -> None:
+        kv_cache_config = kv_cache_configs[self.rpc_rank]


@ruisearch42 @comaniac FYI, if a method needs to send different argument to different ranks, the indexing should use self.rpc_rank , and it should happen in this WorkerWrapperBase

youkaichao · 2025-02-21T03:02:08Z

Separately, do you think we can switch from an ENV variable to an EngineArg?

I don't have strong opinion here.

youkaichao · 2025-02-21T03:04:27Z

vllm/v1/worker/tpu_worker.py

@@ -151,7 +152,7 @@ def execute_model(
        scheduler_output: "SchedulerOutput",
    ) -> Optional[ModelRunnerOutput]:
        output = self.model_runner.execute_model(scheduler_output)
-        return output if self.rank == 0 else None


@WoosukKwon we need to have a base class for the workers, so that we can reduce this part of duplicate code lol

right now i just change both of them, but we need to do the unification in the future.

Got it. Filed #13711 to track the issue.

WoosukKwon

LGTM! Thanks for the PR! Only left minor comments.

WoosukKwon · 2025-02-23T00:34:57Z

tests/distributed/test_torchrun_example.py

@@ -47,6 +47,9 @@ def test_consistent_across_ranks(obj):
    llm.llm_engine.vllm_config.cache_config.num_cpu_blocks)
 test_consistent_across_ranks(
    llm.llm_engine.vllm_config.cache_config.num_gpu_blocks)
+params = list(llm.llm_engine.model_executor.driver_worker.worker.model_runner.


Maybe worth a comment?

vllm/executor/uniproc_executor.py

WoosukKwon · 2025-02-23T00:41:27Z

vllm/v1/worker/tpu_worker.py

@@ -151,7 +152,7 @@ def execute_model(
        scheduler_output: "SchedulerOutput",
    ) -> Optional[ModelRunnerOutput]:
        output = self.model_runner.execute_model(scheduler_output)
-        return output if self.rank == 0 else None


Got it. Filed #13711 to track the issue.

Signed-off-by: youkaichao <[email protected]>

Signed-off-by: youkaichao <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

Signed-off-by: youkaichao <[email protected]>

youkaichao added 10 commits February 21, 2025 10:04

fix assert

80f26c1

Signed-off-by: youkaichao <[email protected]>

use rpc rank

884fd62

Signed-off-by: youkaichao <[email protected]>

fix output

dfefe4e

Signed-off-by: youkaichao <[email protected]>

add config

f7bdc6d

Signed-off-by: youkaichao <[email protected]>

add determine_available_memory

e52e49e

Signed-off-by: youkaichao <[email protected]>

add more tests

96f557b

Signed-off-by: youkaichao <[email protected]>

fix more compatibility

ebf9a33

Signed-off-by: youkaichao <[email protected]>

fix tests

1b5575b

Signed-off-by: youkaichao <[email protected]>

fix tests

4e2623a

Signed-off-by: youkaichao <[email protected]>

add tests

d3d8230

Signed-off-by: youkaichao <[email protected]>

youkaichao requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat and zhuohan123 as code owners February 21, 2025 02:54

mergify bot added ci/build v1 labels Feb 21, 2025

youkaichao commented Feb 21, 2025

View reviewed changes

comment

2d862c8

Signed-off-by: youkaichao <[email protected]>

fix tpu worker

fbf8b3b

Signed-off-by: youkaichao <[email protected]>

youkaichao commented Feb 21, 2025

View reviewed changes

Merge branch 'main' into v1_torchrun

9114d8d

WoosukKwon mentioned this pull request Feb 23, 2025

[V1] Define WorkerBase for V1 Workers #13711

Closed

WoosukKwon approved these changes Feb 23, 2025

View reviewed changes

youkaichao added 4 commits February 23, 2025 13:58

auto set env var

1eaf5c7

Signed-off-by: youkaichao <[email protected]>

update tests

5ea12fd

Signed-off-by: youkaichao <[email protected]>

lazily read env var

e1bee76

Signed-off-by: youkaichao <[email protected]>

add comments

1588ed0

Signed-off-by: youkaichao <[email protected]>

youkaichao enabled auto-merge (squash) February 23, 2025 07:20

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 23, 2025

youkaichao added 6 commits February 23, 2025 15:28

move the test to v1

2f4c005

Signed-off-by: youkaichao <[email protected]>

use wrapper

1983cd4

Signed-off-by: youkaichao <[email protected]>

fix

b10ad89

Signed-off-by: youkaichao <[email protected]>

Merge branch 'main' into v1_torchrun

75028b0

add comments

cce7f49

Signed-off-by: youkaichao <[email protected]>

fix method resolution order

9f4888b

Signed-off-by: youkaichao <[email protected]>

youkaichao disabled auto-merge February 23, 2025 14:47

youkaichao merged commit eb24dc4 into vllm-project:main Feb 23, 2025
56 of 58 checks passed

youkaichao deleted the v1_torchrun branch February 23, 2025 14:47

Akshat-Tripathi pushed a commit to krai/vllm that referenced this pull request Mar 3, 2025

[v1] torchrun compatibility (vllm-project#13642)

dc8db38

Signed-off-by: youkaichao <[email protected]>

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[v1] torchrun compatibility (vllm-project#13642)

141969d

Signed-off-by: youkaichao <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[v1] torchrun compatibility (vllm-project#13642)

01e7e12

Signed-off-by: youkaichao <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[v1] torchrun compatibility #13642

[v1] torchrun compatibility #13642

Uh oh!

youkaichao commented Feb 21, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Feb 21, 2025

Uh oh!

youkaichao Feb 21, 2025

Uh oh!

WoosukKwon Feb 23, 2025

Uh oh!

robertgshaw2-redhat commented Feb 21, 2025

Uh oh!

youkaichao Feb 21, 2025

Uh oh!

youkaichao commented Feb 21, 2025

Uh oh!

youkaichao Feb 21, 2025

Uh oh!

WoosukKwon Feb 23, 2025

Uh oh!

WoosukKwon left a comment

Uh oh!

WoosukKwon Feb 23, 2025

Uh oh!

Uh oh!

WoosukKwon Feb 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[v1] torchrun compatibility #13642

[v1] torchrun compatibility #13642

Uh oh!

Conversation

youkaichao commented Feb 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 21, 2025

Uh oh!

youkaichao Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Feb 23, 2025

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat commented Feb 21, 2025

Uh oh!

youkaichao Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

youkaichao commented Feb 21, 2025

Uh oh!

youkaichao Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Feb 23, 2025

Choose a reason for hiding this comment

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Feb 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WoosukKwon Feb 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

youkaichao commented Feb 21, 2025 •

edited by github-actions bot

Loading