Skip to content

Conversation

limjcst
Copy link
Contributor

@limjcst limjcst commented Apr 8, 2025

I want to reuse a composed module in the pipeline. For example, the following MyModule has a member linear, which is also a module.

class MyModule(torch.nn.Module):
    def __init__(self, n_in: int, n_out: int):
        super().__init__()
        self.linear = torch.nn.Linear(n_in, n_out)
        self.layer_norm = torch.nn.LayerNorm(n_out)

    def forward(self, data: torch.Tensor) -> torch.Tensor:
        hidden = self.linear(data)
        hidden = self.layer_norm(hidden)
        return hidden

MyModule.linear.weight should be synchronized among related ranks. As a result, I add linear.weight to TiedLayerSpec.tied_weight_attr.
BTW, I generate the whole tied_weight_attr by the following instruction.

tied_weight_attr = [name for name, p in layer.named_parameters() if p.numel() > 1]

However, the builtin getattr used by PipelineModule fails to find a nested attribute like linear.weight.
Hence, this PR first extends the builtin getattr to a recursive version PipelineModule._recursive_getattr, accessing each attribute segment one by one.

Meanwhile, the order of tied weights matters in synchronization. This PR suggests to sort tie_keys in PipelineModule._index_tied_modules to avoid hanging.

Extend the builtin `getattr` to a recursive version `PipelineModule._recursive_getattr`
for nested tied weights, e.g., "linear.weight".
Meanwhile, sort tie_keys in `PipelineModule._index_tied_modules` to avoid hanging.

Signed-off-by: Mingjie Li <[email protected]>
@limjcst limjcst requested review from loadams and tohtana as code owners April 8, 2025 06:33
@tohtana
Copy link
Contributor

tohtana commented Apr 8, 2025

Thank you @limjcst for the contribution! This is a significant improvement.

@loadams loadams added this pull request to the merge queue Apr 8, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 8, 2025
@loadams loadams added this pull request to the merge queue Apr 9, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 9, 2025
@limjcst
Copy link
Contributor Author

limjcst commented Apr 9, 2025

nv-accelerate-v100 failed, raising "invalid command 'bdist_wheel'". However, this job succeeded in another run.

Note that the failed job used "cached wheel-0.46.1-py3-none-any.whl.metadata".
@agronholm suggested setuptools >= 70.1 for wheel==0.46.0. Is this suggestion meaningful for DeepSpeed?

pypa/wheel#660 (comment)

@loadams
Copy link
Collaborator

loadams commented Apr 9, 2025

nv-accelerate-v100 failed, raising "invalid command 'bdist_wheel'". However, this job succeeded in another run.

Note that the failed job used "cached wheel-0.46.1-py3-none-any.whl.metadata". @agronholm suggested setuptools >= 70.1 for wheel==0.46.0. Is this suggestion meaningful for DeepSpeed?

pypa/wheel#660 (comment)

Thanks @limjcst - I saw this failure on another PR and will take a look and merge the fixes into your PR when ready.

@loadams loadams added this pull request to the merge queue Apr 9, 2025
@loadams
Copy link
Collaborator

loadams commented Apr 9, 2025

nv-accelerate-v100 failed, raising "invalid command 'bdist_wheel'". However, this job succeeded in another run.
Note that the failed job used "cached wheel-0.46.1-py3-none-any.whl.metadata". @agronholm suggested setuptools >= 70.1 for wheel==0.46.0. Is this suggestion meaningful for DeepSpeed?
pypa/wheel#660 (comment)

Thanks @limjcst - I saw this failure on another PR and will take a look and merge the fixes into your PR when ready.

@limjcst - it looks like the wheel team has yanked the problemabic wheel version so tests should be passing again.

@agronholm
Copy link

Nevertheless, upgrading to setuptools >= 70.1 should prevent any future issues. I noticed that this project lacks a standard pyproject.toml which would be able to specify build requirements.

@loadams
Copy link
Collaborator

loadams commented Apr 9, 2025

Nevertheless, upgrading to setuptools >= 70.1 should prevent any future issues. I noticed that this project lacks a standard pyproject.toml which would be able to specify build requirements.

@agronholm - yes, we have a PR for one here which we will prioritize merging as we know this is needed.

Merged via the queue into deepspeedai:master with commit 185330c Apr 9, 2025
10 checks passed
ys950902 pushed a commit to ys950902/DeepSpeed that referenced this pull request May 21, 2025
I want to reuse a composed module in the pipeline. For example, the
following `MyModule` has a member `linear`, which is also a module.

```python
class MyModule(torch.nn.Module):
    def __init__(self, n_in: int, n_out: int):
        super().__init__()
        self.linear = torch.nn.Linear(n_in, n_out)
        self.layer_norm = torch.nn.LayerNorm(n_out)

    def forward(self, data: torch.Tensor) -> torch.Tensor:
        hidden = self.linear(data)
        hidden = self.layer_norm(hidden)
        return hidden
```

`MyModule.linear.weight` should be synchronized among related ranks. As
a result, I add `linear.weight` to `TiedLayerSpec.tied_weight_attr`.
BTW, I generate the whole `tied_weight_attr` by the following
instruction.

```python
tied_weight_attr = [name for name, p in layer.named_parameters() if p.numel() > 1]
```

However, the builtin `getattr` used by `PipelineModule` fails to find a
nested attribute like `linear.weight`.
Hence, this PR first extends the builtin `getattr` to a recursive
version `PipelineModule._recursive_getattr`, accessing each attribute
segment one by one.

Meanwhile, the order of tied weights matters in synchronization. This PR
suggests to sort tie_keys in `PipelineModule._index_tied_modules` to
avoid hanging.

Signed-off-by: Mingjie Li <[email protected]>
Co-authored-by: Mingjie Li <[email protected]>
Co-authored-by: Masahiro Tanaka <[email protected]>
Signed-off-by: yisheng <[email protected]>
deepcharm pushed a commit to deepcharm/DeepSpeed that referenced this pull request Jun 16, 2025
I want to reuse a composed module in the pipeline. For example, the
following `MyModule` has a member `linear`, which is also a module.

```python
class MyModule(torch.nn.Module):
    def __init__(self, n_in: int, n_out: int):
        super().__init__()
        self.linear = torch.nn.Linear(n_in, n_out)
        self.layer_norm = torch.nn.LayerNorm(n_out)

    def forward(self, data: torch.Tensor) -> torch.Tensor:
        hidden = self.linear(data)
        hidden = self.layer_norm(hidden)
        return hidden
```

`MyModule.linear.weight` should be synchronized among related ranks. As
a result, I add `linear.weight` to `TiedLayerSpec.tied_weight_attr`.
BTW, I generate the whole `tied_weight_attr` by the following
instruction.

```python
tied_weight_attr = [name for name, p in layer.named_parameters() if p.numel() > 1]
```

However, the builtin `getattr` used by `PipelineModule` fails to find a
nested attribute like `linear.weight`.
Hence, this PR first extends the builtin `getattr` to a recursive
version `PipelineModule._recursive_getattr`, accessing each attribute
segment one by one.

Meanwhile, the order of tied weights matters in synchronization. This PR
suggests to sort tie_keys in `PipelineModule._index_tied_modules` to
avoid hanging.

Signed-off-by: Mingjie Li <[email protected]>
Co-authored-by: Mingjie Li <[email protected]>
Co-authored-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Max Kovalenko <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants