Skip to content

Conversation

H-Huang
Copy link
Member

@H-Huang H-Huang commented Sep 18, 2025

Attempt at Option 1 of #1682

Run with:

NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" ./run_train.sh

Cannot torch.export(strict=False) due to

File "/data/users/howardhuang/titan2/torchtitan/models/moe.py", line 148, in forward
    return _run_experts_grouped_mm(
File "/data/users/howardhuang/titan2/torchtitan/distributed/expert_parallel.py", line 261, in wrapper
    ) = generate_permute_indices(
File "/data/users/howardhuang/titan2/torchtitan/experiments/kernels/moe/indices.py", line 204, in generate_permute_indices
    permuted_indices = fill_indices_wrapper(
File "/data/users/howardhuang/titan2/torchtitan/experiments/kernels/moe/indices.py", line 90, in fill_indices_wrapper
    _fill_indices_kernel[grid](
File "<string>", line 4, in dynamic_func
File "/data/users/howardhuang/pytorch/torch/fx/experimental/proxy_tensor.py", line 1409, in __torch_function__
    return func(*args, **kwargs)
File "/data/users/howardhuang/pytorch/torch/fx/experimental/proxy_tensor.py", line 1479, in __torch_function__
    return func(*args, **kwargs)
File "/data/users/howardhuang/pytorch/torch/_export/non_strict_utils.py", line 1066, in __torch_function__
    return func(*args, **kwargs)
RuntimeError: Cannot access data pointer of Tensor (e.g. FakeTensor, FunctionalTensor). If you're using torch.compile/export/fx, it is likely that we are erroneously tracing into a custom kernel. To fix this, please wrap the custom kernel into an opaque custom op. Please see the following for details: https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html

The above exception was the direct cause of the following exception:

File "/data/users/howardhuang/pytorch/torch/distributed/pipelining/_IR.py", line 1007, in _trace_with_export
    raise RuntimeError(
File "/data/users/howardhuang/pytorch/torch/distributed/pipelining/_IR.py", line 1039, in from_tracing
    exported_program = Pipe._trace_with_export(
File "/data/users/howardhuang/pytorch/torch/distributed/pipelining/_IR.py", line 1232, in pipeline
    return Pipe.from_tracing(
File "/data/users/howardhuang/titan2/torchtitan/models/llama3/infra/pipeline.py", line 191, in pipeline_llama_tracer
    pipe = pipeline(
File "/data/users/howardhuang/titan2/torchtitan/train.py", line 237, in __init__
    ) = self.train_spec.pipelining_fn(
File "/data/users/howardhuang/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
File "/data/users/howardhuang/titan2/torchtitan/train.py", line 648, in <module>
    trainer = Trainer(config)
RuntimeError: It seems that we cannot capture your model as a full graph. Typical reasons include graph breaks, data/shape-dependent control flow, or missing meta kernels for custom operators. You can use our manual pipeline interfaces, or try to fix the graph breaks, see https://pytorch.org/docs/stable/export.html

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant