Support llama3 autoparallel + pipelining #1657

wconstab · 2025-08-28T23:28:58Z

so far just tested locally
LOG_RANK=4 CONFIG_FILE=././torchtitan/models/deepseek_v3/train_configs/debug_model.toml ./run_train.sh --model.name llama3_auto_parallel --parallelism.pipeline_parallel_degree 2 --training.steps 100

Runs and loss converges.

Left one TODO about global-batch-size and gradient accumulation

xmfan · 2025-08-29T04:18:42Z

torchtitan/experiments/auto_parallel/parallelize_llama.py

    assert parallel_dims.cp_enabled is False, "CP not supported yet"
-    assert parallel_dims.pp_enabled is False, "PP not supported yet"

+    pp_degree = job_config.parallelism.pipeline_parallel_degree


unused pp degree config, should probably raise error when its not local world size

i deleted it (it was unused/unneeded). I don't think we need to raise any error. pp_degree does not need to equal any particular size, and pp can even be disabled.

xmfan · 2025-08-29T04:23:33Z

torchtitan/experiments/auto_parallel/parallelize_llama.py

+            spmd_dims.append("tp")
+        spmd_mesh = world_mesh[spmd_dims]
+
+        dp_degree = 1


same, config could specify dp_degree

this is something we could potentially upstream to parallel_dims helper. cc @tianyu-l any reason not to offer the convenience @Property of 'dp_degree' in parallel_dims so you don't have to manually figure out if dp_replicate and/or dp_shard are specified and multiply them?

fegin · 2025-08-29T04:32:10Z

torchtitan/train.py

-                        inputs, target=targets, losses=losses, input_batch=inputs
+                        # TODO: input_batch kwarg only needed for CP, but
+                        # autoparallel doesn't accept kwargs in its forward
+                        inputs, target=targets, losses=losses #, input_batch=inputs


Curious, why does CP need input_batch?

I assumed you would know. Am I wrong?

oh, there was a change to remove the need of input_batch. We may want to do the same change to autoparalle.

wconstab · 2025-08-30T15:53:07Z

torchtitan/experiments/auto_parallel/parallelize_llama.py


+    pp_degree = job_config.parallelism.pipeline_parallel_degree
+    local_batch_size = job_config.training.local_batch_size
+    spmd_batch_size = local_batch_size


oops this is a bug for the non-pp case. should be local *dp degree and put in an 'else' branch

ezyang · 2025-09-02T03:28:21Z

torchtitan/train.py

                    self.pp_schedule.step(
-                        inputs, target=targets, losses=losses, input_batch=inputs
+                        # TODO: input_batch kwarg only needed for CP, but
+                        # autoparallel doesn't accept kwargs in its forward


Can we just fix this LOL

ezyang · 2025-09-02T03:31:15Z

torchtitan/experiments/auto_parallel/parallelize_llama.py

+        #     # step.
+        #     dp_degree = parallel_dims.dp_replicate * parallel_dims.dp_shard
+        #     global_batch_size = job_config.training.local_batch_size * dp_degree
+        if parallel_dims.pp_enabled and pp_rank > 0:


What a mess. No action here needed, but it's definitely worth thinking about what the terminal UX state here should be.

so far just tested locally `LOG_RANK=4 CONFIG_FILE=././torchtitan/models/deepseek_v3/train_configs/debug_model.toml ./run_train.sh --model.name llama3_auto_parallel --parallelism.pipeline_parallel_degree 2 --training.steps 100` Runs and loss converges. Left one TODO about global-batch-size and gradient accumulation

wconstab requested review from tianyu-l, fegin and wwwjn as code owners August 28, 2025 23:28

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 28, 2025

wconstab requested review from fmassa, ezyang, sanketpurandare, bdhirsh and xmfan and removed request for fegin, wwwjn and tianyu-l August 28, 2025 23:29

xmfan reviewed Aug 29, 2025

View reviewed changes

xmfan approved these changes Aug 29, 2025

View reviewed changes

fegin reviewed Aug 29, 2025

View reviewed changes

wconstab commented Aug 30, 2025

View reviewed changes

ezyang reviewed Sep 2, 2025

View reviewed changes

wconstab force-pushed the whc/autop_pipeline branch from 7e60a61 to 188b002 Compare September 10, 2025 04:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support llama3 autoparallel + pipelining #1657

Support llama3 autoparallel + pipelining #1657

Uh oh!

wconstab commented Aug 28, 2025

Uh oh!

xmfan Aug 29, 2025

Uh oh!

wconstab Sep 2, 2025

Uh oh!

xmfan Aug 29, 2025

Uh oh!

wconstab Sep 2, 2025

Uh oh!

fegin Aug 29, 2025

Uh oh!

wconstab Aug 30, 2025

Uh oh!

fegin Sep 2, 2025

Uh oh!

wconstab Aug 30, 2025

Uh oh!

wconstab Sep 2, 2025

Uh oh!

ezyang Sep 2, 2025

Uh oh!

ezyang Sep 2, 2025

Uh oh!

Uh oh!

Support llama3 autoparallel + pipelining #1657

Are you sure you want to change the base?

Support llama3 autoparallel + pipelining #1657

Uh oh!

Conversation

wconstab commented Aug 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!