Skip to content

Conversation

HollowMan6
Copy link
Contributor

@HollowMan6 HollowMan6 commented Apr 28, 2025

Some params are one-dimensional, this PR adds support for these params.

Resolve #7249

param.shape torch.Size([768, 1536])
param.shape torch.Size([768])
...
with deepspeed.module_inject.layers.GatherReplacedLayerParams([param], model, enabled=True):
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/module_inject/layers.py", line 359, in __enter__
self.params[0].gather_params(self.params)
File "torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/module_inject/layers.py", line 473, in gather_params
param.shape[1],
~~~~~~~~~~~^^^
IndexError: tuple index out of range

@delock
Copy link
Collaborator

delock commented Apr 28, 2025

Hi @Yejing-Lai can you also take a look at this PR?

@HollowMan6 HollowMan6 changed the title Fix QWen AutoTP when gathering replaced layer params Fix AutoTP gathering replaced layer params when bias is not None Apr 29, 2025
@HollowMan6 HollowMan6 requested a review from inkcherry April 29, 2025 10:06
@inkcherry
Copy link
Contributor

LGTM thanks!

@HollowMan6
Copy link
Contributor Author

Fixed the formatting issue.

@HollowMan6
Copy link
Contributor Author

CI error seems to be caused by the environment instead of this PR:

ImportError: /opt/conda/envs/ptca/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /scratch/azureml/cr/j/bd50c7f98a144dcd900fbdcd8943d8c5/exe/wd/actions-runner/_work/DeepSpeed/DeepSpeed/tests/./torch-extensions/async_io/async_io.so)

@loadams
Copy link
Collaborator

loadams commented May 9, 2025

CI error seems to be caused by the environment instead of this PR:

ImportError: /opt/conda/envs/ptca/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /scratch/azureml/cr/j/bd50c7f98a144dcd900fbdcd8943d8c5/exe/wd/actions-runner/_work/DeepSpeed/DeepSpeed/tests/./torch-extensions/async_io/async_io.so)

Yes @HollowMan6 - thanks for following up on this PR. This is a known CI issue I am working on and hope to have resolved ASAP.

@loadams loadams enabled auto-merge May 19, 2025 05:33
@loadams
Copy link
Collaborator

loadams commented May 19, 2025

CI error seems to be caused by the environment instead of this PR:

ImportError: /opt/conda/envs/ptca/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /scratch/azureml/cr/j/bd50c7f98a144dcd900fbdcd8943d8c5/exe/wd/actions-runner/_work/DeepSpeed/DeepSpeed/tests/./torch-extensions/async_io/async_io.so)

Yes @HollowMan6 - thanks for following up on this PR. This is a known CI issue I am working on and hope to have resolved ASAP.

@HollowMan6 - the CI is working now and we see this error:

E           RuntimeError: Output 0 of RowParallelBackward is a view and is being modified inplace. This view was created inside a custom Function (or because an input was returned as-is) and the autograd logic to handle view+inplace would override the custom backward associated with the custom Function, leading to incorrect gradients. This behavior is forbidden. You can fix this by cloning the output of the custom Function.

auto-merge was automatically disabled May 19, 2025 15:40

Head branch was pushed to by a user without write access

@HollowMan6
Copy link
Contributor Author

HollowMan6 commented May 19, 2025

Did a fix according to the error message suggestion, hope this can make things work.

@inkcherry
Copy link
Contributor

inkcherry commented May 22, 2025

hi @HollowMan6 ,You might consider printing the output. If they appear equal, try slightly relaxing the precision in allclose check. I've run into cases where allclose passed in my device but failed in CI environments. This might indicate that they're on the edge of a threshold.

@HollowMan6
Copy link
Contributor Author

Thanks, I will check!

HollowMan6 and others added 2 commits May 22, 2025 22:24
Some params are one-dimentional, this PR adds support
for these params.

```log
with deepspeed.module_inject.layers.GatherReplacedLayerParams([param], model, enabled=True):
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/module_inject/layers.py", line 359, in __enter__
self.params[0].gather_params(self.params)
File "torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/module_inject/layers.py", line 473, in gather_params
param.shape[1],
~~~~~~~~~~~^^^
IndexError: tuple index out of range
```

Signed-off-by: Hollow Man <[email protected]>
@inkcherry
Copy link
Contributor

the ci failure should be another issue, I will send a patch on this branch to fix the ci.

@inkcherry
Copy link
Contributor

https://github.com/HollowMan6/DeepSpeed/pull/1 @HollowMan6, you can merge this, I verified the ci has passed. hope this is helpful for you. thanks for your effort on this issue.

HollowMan6 and others added 2 commits May 24, 2025 14:03
@hwchen2017 hwchen2017 added this pull request to the merge queue May 25, 2025
Merged via the queue into deepspeedai:master with commit b666844 May 25, 2025
10 checks passed
deepcharm pushed a commit to deepcharm/DeepSpeed that referenced this pull request Jun 16, 2025
…pspeedai#7257)

Some params are one-dimensional, this PR adds support for these params.

Resolve deepspeedai#7249

```log
param.shape torch.Size([768, 1536])
param.shape torch.Size([768])
...
```

```log
with deepspeed.module_inject.layers.GatherReplacedLayerParams([param], model, enabled=True):
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/module_inject/layers.py", line 359, in __enter__
self.params[0].gather_params(self.params)
File "torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/module_inject/layers.py", line 473, in gather_params
param.shape[1],
~~~~~~~~~~~^^^
IndexError: tuple index out of range
```

---------

Signed-off-by: Hollow Man <[email protected]>
Signed-off-by: inkcherry <[email protected]>
Co-authored-by: Hongwei Chen <[email protected]>
Co-authored-by: inkcherry <[email protected]>
Signed-off-by: Max Kovalenko <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]zero2 + autotp: IndexError: tuple index out of range
5 participants