-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Fix: Update grad norm calculation for CPU offload #7302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: Update grad norm calculation for CPU offload #7302
Conversation
92d30de
to
02a646a
Compare
@therealnaveenkamal thanks for the PR. Unfortunately, the current approach won't work because
It seems the out-of-sync data structure is What. do you think? |
adbcecf
to
91fa537
Compare
@sfc-gh-truwase I agree. I've reverted my changes and added a norm update in the local API. Please let me know your thoughts
|
@therealnaveenkamal, looks good to me. Did you get a chance to test? Also, is it possible to convert your test case into a unit test? |
Merge branch 'master' of https://github.com/deepspeedai/DeepSpeed
Signed-off-by: Naveenraj Kamalakannan <[email protected]>
91fa537
to
47b229f
Compare
@sfc-gh-truwase I've added a unit-test file test_zero_grad_clip.py. Runs fours tests with BF16 and FP16. |
post_clip_norm = clamped_grad.norm().item() | ||
|
||
if pre_clip_norm > clip_value: | ||
print(f"DEBUG: Param {param.ds_id} - Pre-clip norm: {pre_clip_norm:.6f}, Post-clip norm: {post_clip_norm:.6f}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the print()
so the unit test is not noisy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfc-gh-truwase Thanks for letting me know. I've updated the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfc-gh-truwase Looks like the test is failing...module not found error: mpi4py. Can you please help me here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take a look on the next run. Usually, mpi4py
should have been handled by the requirements.txt.
Also, did you run the formatting checks?
https://github.com/deepspeedai/DeepSpeed/blob/master/CONTRIBUTING.md#prerequisites
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfc-gh-truwase I updated my script, works without mpi4py. Also, did the formatting tests. I did sign-off, but still the DCO shows it failed.
and sorry for the trouble caused, I'm a first time contributor!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@therealnaveenkamal, no apologies needed. We love first time contributors :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfc-gh-truwase looks like there were issues with torch.distributed setup. When I test locally, the tests passed. I've handled exceptions and pushed my updates. hopefully, the workflow passes this time.
configfile: pytest.ini
plugins: forked-1.6.0
collected 4 items
tests/unit/runtime/zero/test_zero_grad_clip.py::TestZeroGradClip::test_grad_clip_and_norm_update[fp16-0.5-cpu] PASSED [ 25%]
tests/unit/runtime/zero/test_zero_grad_clip.py::TestZeroGradClip::test_grad_clip_and_norm_update[bf16-0.05-cpu] PASSED [ 50%]
tests/unit/runtime/zero/test_zero_grad_clip.py::TestZeroGradClip::test_grad_clip_and_norm_update[fp16-0.5-none] PASSED [ 75%]
tests/unit/runtime/zero/test_zero_grad_clip.py::TestZeroGradClip::test_grad_clip_and_norm_update[bf16-0.05-none] PASSED [100%]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the flaky CI. I will also keep an eye on it. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfc-gh-truwase Thanks for the support. Would love to contribute more.
Signed-off-by: Naveenraj Kamalakannan <[email protected]>
…o naveen/fix-cpu-offload
Signed-off-by: Naveenraj Kamalakannan <[email protected]>
Signed-off-by: Naveenraj Kamalakannan <[email protected]>
## Description This PR fixes an issue where gradient clipping modifications are not reflected in the global gradient norm calculation when CPU offloading is enabled. The issue occurs because the `averaged_gradients` are not being updated with the clipped gradients when CPU offloading is active. ## Problem When using CPU offloading with gradient clipping: 1. The gradients are successfully clipped using `safe_set_local_grad` 2. However, the `_global_grad_norm` calculation still uses the original unclipped gradients. 3. This leads to incorrect gradient norm reporting and potential issues with gradient clipping effectiveness ## Solution The fix ensures that the `averaged_gradients` are properly updated with the clipped gradients when CPU offloading is enabled, similar to how it works when CPU offloading is disabled. ## Testing The fix has been tested with: - CPU offloading enabled and disabled - Different gradient clipping values - A simple model with linear layers - Both FP16 and BF16 ## Related Issues Fixes deepspeedai#7292 --------- Signed-off-by: Naveenraj Kamalakannan <[email protected]> Signed-off-by: Max Kovalenko <[email protected]>
## Description This PR fixes an issue where gradient clipping modifications are not reflected in the global gradient norm calculation when CPU offloading is enabled. The issue occurs because the `averaged_gradients` are not being updated with the clipped gradients when CPU offloading is active. ## Problem When using CPU offloading with gradient clipping: 1. The gradients are successfully clipped using `safe_set_local_grad` 2. However, the `_global_grad_norm` calculation still uses the original unclipped gradients. 3. This leads to incorrect gradient norm reporting and potential issues with gradient clipping effectiveness ## Solution The fix ensures that the `averaged_gradients` are properly updated with the clipped gradients when CPU offloading is enabled, similar to how it works when CPU offloading is disabled. ## Testing The fix has been tested with: - CPU offloading enabled and disabled - Different gradient clipping values - A simple model with linear layers - Both FP16 and BF16 ## Related Issues Fixes deepspeedai#7292 --------- Signed-off-by: Naveenraj Kamalakannan <[email protected]>
Description
This PR fixes an issue where gradient clipping modifications are not reflected in the global gradient norm calculation when CPU offloading is enabled. The issue occurs because the
averaged_gradients
are not being updated with the clipped gradients when CPU offloading is active.Problem
When using CPU offloading with gradient clipping:
safe_set_local_grad
_global_grad_norm
calculation still uses the original unclipped gradients.Solution
The fix ensures that the
averaged_gradients
are properly updated with the clipped gradients when CPU offloading is enabled, similar to how it works when CPU offloading is disabled.Testing
The fix has been tested with:
Related Issues
Fixes #7292