Fix: Update grad norm calculation for CPU offload #7302

therealnaveenkamal · 2025-05-22T06:33:51Z

Description

This PR fixes an issue where gradient clipping modifications are not reflected in the global gradient norm calculation when CPU offloading is enabled. The issue occurs because the averaged_gradients are not being updated with the clipped gradients when CPU offloading is active.

Problem

When using CPU offloading with gradient clipping:

The gradients are successfully clipped using safe_set_local_grad
However, the _global_grad_norm calculation still uses the original unclipped gradients.
This leads to incorrect gradient norm reporting and potential issues with gradient clipping effectiveness

Solution

The fix ensures that the averaged_gradients are properly updated with the clipped gradients when CPU offloading is enabled, similar to how it works when CPU offloading is disabled.

Testing

The fix has been tested with:

CPU offloading enabled and disabled
Different gradient clipping values
A simple model with linear layers
Both FP16 and BF16

Related Issues

Fixes #7292

deepspeed/runtime/zero/stage3.py

sfc-gh-truwase · 2025-05-22T18:16:41Z

@therealnaveenkamal thanks for the PR. Unfortunately, the current approach won't work because

perf issue for backward
averaged_gradients (despite the poor naming) is meant for the GPU-only execution.

It seems the out-of-sync data structure is self.norm_for_param_grads. In offload case, norms are computed on-the-fly and maintained in self.norm_for_param_grads. I wonder if safe_set_* APIs can be handled by also calling self.set_norm_for_param_grad_in_gpu(param) similar to during backward pass.

What. do you think?

therealnaveenkamal · 2025-05-22T22:53:29Z

@sfc-gh-truwase I agree. I've reverted my changes and added a norm update in the local API. Please let me know your thoughts

if self.offload_optimizer:
            self.norm_for_param_grads[self.get_param_id(param)] = self._constant_buffered_norm2(value)

sfc-gh-truwase · 2025-05-22T22:57:36Z

@sfc-gh-truwase I agree. I've reverted my changes and added a norm update in the local API. Please let me know your thoughts
if self.offload_optimizer:
            self.norm_for_param_grads[self.get_param_id(param)] = self._constant_buffered_norm2(value)

@therealnaveenkamal, looks good to me. Did you get a chance to test? Also, is it possible to convert your test case into a unit test?

Merge branch 'master' of https://github.com/deepspeedai/DeepSpeed

Signed-off-by: Naveenraj Kamalakannan <[email protected]>

therealnaveenkamal · 2025-05-23T03:42:33Z

@sfc-gh-truwase I've added a unit-test file test_zero_grad_clip.py. Runs fours tests with BF16 and FP16.

sfc-gh-truwase · 2025-05-23T14:54:02Z

tests/unit/runtime/zero/test_zero_grad_clip.py

+                post_clip_norm = clamped_grad.norm().item()
+
+                if pre_clip_norm > clip_value:
+                    print(f"DEBUG: Param {param.ds_id} - Pre-clip norm: {pre_clip_norm:.6f}, Post-clip norm: {post_clip_norm:.6f}")


Please remove the print() so the unit test is not noisy

@sfc-gh-truwase Thanks for letting me know. I've updated the file.

@sfc-gh-truwase Looks like the test is failing...module not found error: mpi4py. Can you please help me here?

I will take a look on the next run. Usually, mpi4py should have been handled by the requirements.txt.

Also, did you run the formatting checks?
https://github.com/deepspeedai/DeepSpeed/blob/master/CONTRIBUTING.md#prerequisites

@sfc-gh-truwase I updated my script, works without mpi4py. Also, did the formatting tests. I did sign-off, but still the DCO shows it failed.

and sorry for the trouble caused, I'm a first time contributor!

@therealnaveenkamal, no apologies needed. We love first time contributors :)

@sfc-gh-truwase looks like there were issues with torch.distributed setup. When I test locally, the tests passed. I've handled exceptions and pushed my updates. hopefully, the workflow passes this time.

configfile: pytest.ini plugins: forked-1.6.0 collected 4 items tests/unit/runtime/zero/test_zero_grad_clip.py::TestZeroGradClip::test_grad_clip_and_norm_update[fp16-0.5-cpu] PASSED [ 25%] tests/unit/runtime/zero/test_zero_grad_clip.py::TestZeroGradClip::test_grad_clip_and_norm_update[bf16-0.05-cpu] PASSED [ 50%] tests/unit/runtime/zero/test_zero_grad_clip.py::TestZeroGradClip::test_grad_clip_and_norm_update[fp16-0.5-none] PASSED [ 75%] tests/unit/runtime/zero/test_zero_grad_clip.py::TestZeroGradClip::test_grad_clip_and_norm_update[bf16-0.05-none] PASSED [100%]

Sorry about the flaky CI. I will also keep an eye on it. Thanks!

@sfc-gh-truwase Thanks for the support. Would love to contribute more.

Signed-off-by: Naveenraj Kamalakannan <[email protected]>

…o naveen/fix-cpu-offload

Signed-off-by: Naveenraj Kamalakannan <[email protected]>

## Description This PR fixes an issue where gradient clipping modifications are not reflected in the global gradient norm calculation when CPU offloading is enabled. The issue occurs because the `averaged_gradients` are not being updated with the clipped gradients when CPU offloading is active. ## Problem When using CPU offloading with gradient clipping: 1. The gradients are successfully clipped using `safe_set_local_grad` 2. However, the `_global_grad_norm` calculation still uses the original unclipped gradients. 3. This leads to incorrect gradient norm reporting and potential issues with gradient clipping effectiveness ## Solution The fix ensures that the `averaged_gradients` are properly updated with the clipped gradients when CPU offloading is enabled, similar to how it works when CPU offloading is disabled. ## Testing The fix has been tested with: - CPU offloading enabled and disabled - Different gradient clipping values - A simple model with linear layers - Both FP16 and BF16 ## Related Issues Fixes deepspeedai#7292 --------- Signed-off-by: Naveenraj Kamalakannan <[email protected]> Signed-off-by: Max Kovalenko <[email protected]>

## Description This PR fixes an issue where gradient clipping modifications are not reflected in the global gradient norm calculation when CPU offloading is enabled. The issue occurs because the `averaged_gradients` are not being updated with the clipped gradients when CPU offloading is active. ## Problem When using CPU offloading with gradient clipping: 1. The gradients are successfully clipped using `safe_set_local_grad` 2. However, the `_global_grad_norm` calculation still uses the original unclipped gradients. 3. This leads to incorrect gradient norm reporting and potential issues with gradient clipping effectiveness ## Solution The fix ensures that the `averaged_gradients` are properly updated with the clipped gradients when CPU offloading is enabled, similar to how it works when CPU offloading is disabled. ## Testing The fix has been tested with: - CPU offloading enabled and disabled - Different gradient clipping values - A simple model with linear layers - Both FP16 and BF16 ## Related Issues Fixes deepspeedai#7292 --------- Signed-off-by: Naveenraj Kamalakannan <[email protected]>

Fix: Update grad norm calculation for CPU offload

92d30de

therealnaveenkamal requested review from tjruwase and tohtana as code owners May 22, 2025 06:33

therealnaveenkamal force-pushed the naveen/fix-cpu-offload branch from 92d30de to 02a646a Compare May 22, 2025 06:46

sfc-gh-truwase reviewed May 22, 2025

View reviewed changes

deepspeed/runtime/zero/stage3.py Outdated Show resolved Hide resolved

therealnaveenkamal force-pushed the naveen/fix-cpu-offload branch from adbcecf to 91fa537 Compare May 22, 2025 22:45

therealnaveenkamal added 3 commits May 22, 2025 23:35

Merge Upstream

1746532

Merge branch 'master' of https://github.com/deepspeedai/DeepSpeed

Modified set_local_grad_for_param to update norm_for_param_grads

db5abed

Unittest for CPU Offload Norm Grad Update

47b229f

Signed-off-by: Naveenraj Kamalakannan <[email protected]>

therealnaveenkamal force-pushed the naveen/fix-cpu-offload branch from 91fa537 to 47b229f Compare May 23, 2025 03:39

therealnaveenkamal requested a review from loadams as a code owner May 23, 2025 03:39

therealnaveenkamal requested a review from sfc-gh-truwase May 23, 2025 03:40

sfc-gh-truwase reviewed May 23, 2025

View reviewed changes

therealnaveenkamal added 4 commits May 23, 2025 11:15

test: removed print and cleanup

d8cf050

Signed-off-by: Naveenraj Kamalakannan <[email protected]>

Merge branch 'master' of https://github.com/deepspeedai/DeepSpeed int…

2ac4eaf

…o naveen/fix-cpu-offload

Fixed mpi4py dependency and formatting

b6bea01

Signed-off-by: Naveenraj Kamalakannan <[email protected]>

fix: port issue, handled exceptions and fp16 support

cb9e723

Signed-off-by: Naveenraj Kamalakannan <[email protected]>

sfc-gh-truwase approved these changes May 27, 2025

View reviewed changes

sfc-gh-truwase added this pull request to the merge queue May 27, 2025

Merged via the queue into deepspeedai:master with commit b9af5d8 May 27, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Update grad norm calculation for CPU offload #7302

Fix: Update grad norm calculation for CPU offload #7302

Uh oh!

therealnaveenkamal commented May 22, 2025

Uh oh!

Uh oh!

sfc-gh-truwase commented May 22, 2025 •

edited

Loading

Uh oh!

therealnaveenkamal commented May 22, 2025 •

edited

Loading

Uh oh!

sfc-gh-truwase commented May 22, 2025

Uh oh!

therealnaveenkamal commented May 23, 2025

Uh oh!

sfc-gh-truwase May 23, 2025

Uh oh!

therealnaveenkamal May 23, 2025

Uh oh!

therealnaveenkamal May 23, 2025

Uh oh!

sfc-gh-truwase May 23, 2025

Uh oh!

therealnaveenkamal May 23, 2025

Uh oh!

sfc-gh-truwase May 23, 2025

Uh oh!

therealnaveenkamal May 23, 2025

Uh oh!

sfc-gh-truwase May 23, 2025

Uh oh!

therealnaveenkamal May 23, 2025

Uh oh!

Uh oh!

Uh oh!

Fix: Update grad norm calculation for CPU offload #7302

Fix: Update grad norm calculation for CPU offload #7302

Uh oh!

Conversation

therealnaveenkamal commented May 22, 2025

Description

Problem

Solution

Testing

Related Issues

Uh oh!

Uh oh!

sfc-gh-truwase commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

therealnaveenkamal commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfc-gh-truwase commented May 22, 2025

Uh oh!

therealnaveenkamal commented May 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sfc-gh-truwase commented May 22, 2025 •

edited

Loading

therealnaveenkamal commented May 22, 2025 •

edited

Loading