[Option 1] export example #1722

H-Huang · 2025-09-18T17:15:36Z

Attempt at Option 1 of #1682

Run with:

NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" ./run_train.sh

Cannot torch.export(strict=False) due to

File "/data/users/howardhuang/titan2/torchtitan/models/moe.py", line 148, in forward
    return _run_experts_grouped_mm(
File "/data/users/howardhuang/titan2/torchtitan/distributed/expert_parallel.py", line 261, in wrapper
    ) = generate_permute_indices(
File "/data/users/howardhuang/titan2/torchtitan/experiments/kernels/moe/indices.py", line 204, in generate_permute_indices
    permuted_indices = fill_indices_wrapper(
File "/data/users/howardhuang/titan2/torchtitan/experiments/kernels/moe/indices.py", line 90, in fill_indices_wrapper
    _fill_indices_kernel[grid](
File "<string>", line 4, in dynamic_func
File "/data/users/howardhuang/pytorch/torch/fx/experimental/proxy_tensor.py", line 1409, in __torch_function__
    return func(*args, **kwargs)
File "/data/users/howardhuang/pytorch/torch/fx/experimental/proxy_tensor.py", line 1479, in __torch_function__
    return func(*args, **kwargs)
File "/data/users/howardhuang/pytorch/torch/_export/non_strict_utils.py", line 1066, in __torch_function__
    return func(*args, **kwargs)
RuntimeError: Cannot access data pointer of Tensor (e.g. FakeTensor, FunctionalTensor). If you're using torch.compile/export/fx, it is likely that we are erroneously tracing into a custom kernel. To fix this, please wrap the custom kernel into an opaque custom op. Please see the following for details: https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html

The above exception was the direct cause of the following exception:

File "/data/users/howardhuang/pytorch/torch/distributed/pipelining/_IR.py", line 1007, in _trace_with_export
    raise RuntimeError(
File "/data/users/howardhuang/pytorch/torch/distributed/pipelining/_IR.py", line 1039, in from_tracing
    exported_program = Pipe._trace_with_export(
File "/data/users/howardhuang/pytorch/torch/distributed/pipelining/_IR.py", line 1232, in pipeline
    return Pipe.from_tracing(
File "/data/users/howardhuang/titan2/torchtitan/models/llama3/infra/pipeline.py", line 191, in pipeline_llama_tracer
    pipe = pipeline(
File "/data/users/howardhuang/titan2/torchtitan/train.py", line 237, in __init__
    ) = self.train_spec.pipelining_fn(
File "/data/users/howardhuang/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
File "/data/users/howardhuang/titan2/torchtitan/train.py", line 648, in <module>
    trainer = Trainer(config)
RuntimeError: It seems that we cannot capture your model as a full graph. Typical reasons include graph breaks, data/shape-dependent control flow, or missing meta kernels for custom operators. You can use our manual pipeline interfaces, or try to fix the graph breaks, see https://pytorch.org/docs/stable/export.html

[Example] export example

9d5bc4f

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 18, 2025

H-Huang requested a review from tugsbayasgalan September 19, 2025 20:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Option 1] export example #1722

[Option 1] export example #1722

Uh oh!

H-Huang commented Sep 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

[Option 1] export example #1722

Are you sure you want to change the base?

[Option 1] export example #1722

Uh oh!

Conversation

H-Huang commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

H-Huang commented Sep 18, 2025 •

edited

Loading