fix: match no_sync with pytorch #2255

k223kim · 2025-06-18T15:41:29Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

This PR fixes #1670; it adds a test that shows how the current thunder's no_sync context manager already behaves like torch. If accepted, I will have a follow up PR that removes sync_grads entirely and update the docs.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

for more information, see https://pre-commit.ci

thunder/core/module.py

Borda · 2025-06-20T15:56:54Z

thunder/tests/distributed/helper.py

+        device = torch.device("cuda", test_case.rank)
+
+        gradients = defaultdict(list)
+        for use_no_sync in (True, False):


can we have it rather as an argument ot we really need this loop?

Hmm, we do not need this loop. Would it make a difference if we use arguments?

from the perspective of what we test, there's no differences but making use_no_sync argument would make test logs way better. It clearly tells us which setting fails or not. I'd have use_no_sync as an argument as @Borda suggests.

I think we should split the test.
That said, I would expect the no-sync test to roughly do the following:

for j in range(2): for i in range(2): with no_sync(): fw loss bw grab grad.clones for comparison # these would not be synced, even outside the context manager. fw loss bw # sync happens only(!) here and below grab grad.clones for comparison # these are synced zero_grad fw loss bw # sync happens only(!) here and above grab grad.clones for comparison # these are synced zero_grad

and then the comparison is between the gradients in order between plain PT DDP and Thunder DDP.

Note that the grabbing of the grads happens outside the no_sync.

Co-authored-by: Jirka Borovec <[email protected]>

k223kim · 2025-06-20T16:32:16Z

thunder/tests/distributed/helper.py

+                            test_case.assertGreater(len(no_sync_bwd_trc.bound_symbols), 1)
+                assert torch.allclose(torch_loss, loss, atol=1e-4, rtol=1e-4)
+
+                torch.testing.assert_close(torch_grad, thunder_grad, atol=1e-3, rtol=1e-3)


@t-vi I think this should fail if thunder and torch are different

for more information, see https://pre-commit.ci

init

e458911

k223kim changed the title ~~feat: match no_sync with pytorch~~ [wip] feat: match no_sync with pytorch Jun 18, 2025

pre-commit-ci bot and others added 7 commits June 18, 2025 15:41

[pre-commit.ci] auto fixes from pre-commit.com hooks

3007db1

for more information, see https://pre-commit.ci

fix: test comparing no_sync

5dfb46d

[pre-commit.ci] auto fixes from pre-commit.com hooks

73647ac

for more information, see https://pre-commit.ci

feat: add test for torch

1999046

[pre-commit.ci] auto fixes from pre-commit.com hooks

30f3844

for more information, see https://pre-commit.ci

fix: tolerance

b688031

fix: remove sync_grads

3e6bb25

k223kim marked this pull request as ready for review June 20, 2025 15:41

k223kim requested review from mruberry, lantiga and t-vi as code owners June 20, 2025 15:41

k223kim changed the title ~~[wip] feat: match no_sync with pytorch~~ fix: match no_sync with pytorch Jun 20, 2025

Borda reviewed Jun 20, 2025

View reviewed changes

thunder/core/module.py Outdated Show resolved Hide resolved

Borda reviewed Jun 20, 2025

View reviewed changes

Update thunder/core/module.py

c5751df

Co-authored-by: Jirka Borovec <[email protected]>

k223kim commented Jun 20, 2025

View reviewed changes

k223kim marked this pull request as draft June 23, 2025 09:25

k223kim and others added 4 commits June 23, 2025 23:48

fix: parameterize use_no_sync

55d4f77

[pre-commit.ci] auto fixes from pre-commit.com hooks

5c52201

for more information, see https://pre-commit.ci

fix: move grad tracking outside of context manager

08e6905

[pre-commit.ci] auto fixes from pre-commit.com hooks

d8f4f9b

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: match no_sync with pytorch #2255

fix: match no_sync with pytorch #2255

k223kim commented Jun 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Borda Jun 20, 2025

Uh oh!

k223kim Jun 20, 2025

Uh oh!

crcrpar Jun 21, 2025

Uh oh!

t-vi Jun 21, 2025 •

edited

Loading

Uh oh!

t-vi Jun 21, 2025

Uh oh!

k223kim Jun 20, 2025

Uh oh!

Uh oh!

fix: match no_sync with pytorch #2255

Are you sure you want to change the base?

fix: match no_sync with pytorch #2255

Conversation

k223kim commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

PR review

Did you have fun?

Uh oh!

Uh oh!

Borda Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

k223kim Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

crcrpar Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

t-vi Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

t-vi Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

k223kim Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

k223kim commented Jun 18, 2025 •

edited

Loading

t-vi Jun 21, 2025 •

edited

Loading