Fix: Synchronize SIGTERM Handling in DDP to Prevent Deadlocks #20825

KAVYANSHTYAGI · 2025-05-13T19:52:50Z

What does this PR do?

This PR resolves a critical issue in PyTorch Lightning's Distributed Data Parallel (DDP) training where a SIGTERM signal received by one rank (typically rank 0) can cause deadlocks due to unsynchronized termination behavior across ranks.

Problem Description

In DDP training, PyTorch requires all processes to synchronize at every training step. If one rank exits early (e.g., due to receiving a SIGTERM signal), while others continue to the next training step, they hang indefinitely waiting at synchronization barriers such as all_gather or reduce.

This issue becomes highly problematic in production environments, especially in distributed setups orchestrated by Kubernetes or SLURM, where preemptions or resource eviction events are frequent.

Specifically, the flow causing the bug:

Rank 0 receives SIGTERM and raises SIGTERMException after on_advance_end.
Other ranks do not receive this signal and proceed into the next batch.
These ranks block at the beginning of the next batch, waiting for rank 0 — which has already exited.
This results in a distributed deadlock.

Fix

This PR introduces a synchronized SIGTERM handling mechanism using torch.distributed.broadcast():

Upon receiving a SIGTERM, rank 0 sets a sigterm_tensor = 1 and broadcasts it to all other ranks.
At the start of each training batch, every rank checks this broadcasted signal.
If the signal indicates a SIGTERM was received:
- All ranks call torch.distributed.barrier() to synchronize safely.
- Then, they raise SIGTERMException in unison, exiting the training loop cleanly.

This ensures no rank continues into the next batch while others have exited, thereby preventing deadlocks and enabling graceful termination and checkpointing.

Test Included

test_ddp_sigterm_handling.py:
- Simulates a multi-process DDP training setup.
- Injects SIGTERM in rank 0 during batch 2.
- Asserts that:
  - All ranks receive the SIGTERM via broadcast.
  - All ranks raise SIGTERMException gracefully.
  - No deadlock occurs.

This test is guarded with @pytest.mark.skipif to skip in single-device or non-distributed setups, ensuring safety across CI platforms.

Fixes #20806

Before submitting

Was this discussed/agreed via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (not necessary here)
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request? (None)
Did you update the CHANGELOG? (Optional for internal changes)

Reviewer checklist

Is this pull request ready for review?
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Fun fact: This fix is critical for scaling PL in real-world clusters with robust SIGTERM checkpointing and graceful exits. Hope this PR saves someone’s multi-day training job from being lost....

📚 Documentation preview 📚: https://pytorch-lightning--20825.org.readthedocs.build/en/20825/

for more information, see https://pre-commit.ci

Borda

over all looks good, just some test on CPU seems not being happy about it..

for more information, see https://pre-commit.ci

KAVYANSHTYAGI

This PR now adds a synchronized SIGTERM handling for DDP training in PyTorch Lightning which safely broadcasts termination across all ranks to prevent deadlocks and exits cleanly via SIGTERMException. All distributed logic is CI-safe and DDP tests are conditionally skipped on unsupported environments. The patch is now robust, minimal and production friendly.

lantiga

Great fix @KAVYANSHTYAGI

src/lightning/pytorch/loops/training_epoch_loop.py

for more information, see https://pre-commit.ci

KAVYANSHTYAGI

This update refines SIGTERM handling in DDP training by ensuring that SIGTERMException is only raised after a successful broadcast and is no longer suppressed unintentionally. The fix moves the exception outside the try block, addressing concerns about it being silently ignored.

deependujha · 2025-05-17T12:04:51Z

Hi @KAVYANSHTYAGI can you please fix pre-commit. It already gives you hint to use contextlib.suppress.

for more information, see https://pre-commit.ci

KAVYANSHTYAGI

Looks like Probot / required-jobs failed due to a permissions issue (403 Forbidden when trying to comment) and a 60-minute timeout.

Copilot suggests updating .github/workflows/probot-check-group.yml with:

permissions:
contents: write
pull-requests: write
timeout-minutes: 90

Can you help update this so future PRs aren’t blocked?
Do let me know what do you think or is there any other way I can handle this....

Borda · 2025-05-23T15:06:04Z

Copilot suggests updating .github/workflows/probot-check-group.yml with:

I see many failing checks not just that one so I would not bother with it until real testing is resolved :)

deependujha · 2025-05-23T15:26:40Z

@KAVYANSHTYAGI @Borda I guess failing jobs yesterday or day before yesterday was due to github action being down. Try either rerunning the job or make some changes and push.

Borda · 2025-05-27T13:39:34Z

restarted all CI

KAVYANSHTYAGI

Thankfully all the test cases have passed, please review it. Thank you for your support.

Borda · 2025-05-28T07:27:40Z

Thankfully all the test cases have passed, please review it. Thank you for your support.

Looks great, thank you

src/lightning/pytorch/CHANGELOG.md

* Update signal_connector.py * Update training_epoch_loop.py * Create test_ddp_sigterm_handling.py * update + chlog * Apply suggestions from code review --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka B <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> (cherry picked from commit 989b759)

…ing-AI#20825) * Update signal_connector.py * Update training_epoch_loop.py * Create test_ddp_sigterm_handling.py * update + chlog * Apply suggestions from code review --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka B <[email protected]> Co-authored-by: Jirka Borovec <[email protected]>

KAVYANSHTYAGI added 3 commits May 14, 2025 00:23

Update signal_connector.py

6a1bbf1

Update training_epoch_loop.py

2761ad8

Create test_ddp_sigterm_handling.py

93c3e69

KAVYANSHTYAGI requested review from lantiga, Borda, tchaton, justusschock and ethanwharris as code owners May 13, 2025 19:52

github-actions bot added the pl Generic label for PyTorch Lightning package label May 13, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

b7cef51

for more information, see https://pre-commit.ci

Borda reviewed May 14, 2025

View reviewed changes

KAVYANSHTYAGI and others added 10 commits May 15, 2025 22:16

Update training_epoch_loop.py

2fc5178

[pre-commit.ci] auto fixes from pre-commit.com hooks

c8c9523

for more information, see https://pre-commit.ci

Update test_ddp_sigterm_handling.py

ebbe682

[pre-commit.ci] auto fixes from pre-commit.com hooks

f327aa7

for more information, see https://pre-commit.ci

Update training_epoch_loop.py

f50b3a9

[pre-commit.ci] auto fixes from pre-commit.com hooks

49e2fab

for more information, see https://pre-commit.ci

Update training_epoch_loop.py

5600bff

[pre-commit.ci] auto fixes from pre-commit.com hooks

873a792

for more information, see https://pre-commit.ci

Update test_ddp_sigterm_handling.py

4a275da

[pre-commit.ci] auto fixes from pre-commit.com hooks

b792073

for more information, see https://pre-commit.ci

KAVYANSHTYAGI commented May 16, 2025

View reviewed changes

Borda approved these changes May 16, 2025

View reviewed changes

lantiga approved these changes May 16, 2025

View reviewed changes

linter

63c922d

Borda reviewed May 16, 2025

View reviewed changes

src/lightning/pytorch/loops/training_epoch_loop.py Outdated Show resolved Hide resolved

KAVYANSHTYAGI and others added 3 commits May 17, 2025 13:46

Merge branch 'Lightning-AI:master' into sigterm-deadlock

22dc0ab

Update training_epoch_loop.py

71189de

[pre-commit.ci] auto fixes from pre-commit.com hooks

ec210cb

for more information, see https://pre-commit.ci

KAVYANSHTYAGI commented May 17, 2025

View reviewed changes

KAVYANSHTYAGI and others added 3 commits May 22, 2025 12:30

Merge branch 'Lightning-AI:master' into sigterm-deadlock

cb25184

Update training_epoch_loop.py

81b3d24

[pre-commit.ci] auto fixes from pre-commit.com hooks

67a3b57

for more information, see https://pre-commit.ci

KAVYANSHTYAGI commented May 22, 2025

View reviewed changes

Merge branch 'master' into sigterm-deadlock

79b39db

KAVYANSHTYAGI commented May 27, 2025

View reviewed changes

update + chlog

d1ab68f

Borda reviewed May 28, 2025

View reviewed changes

src/lightning/pytorch/CHANGELOG.md Outdated Show resolved Hide resolved

Borda and others added 3 commits May 28, 2025 09:28

Apply suggestions from code review

7293e6e

linting

857637a

type

36e9ecf

Borda merged commit 989b759 into Lightning-AI:master May 28, 2025
84 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Synchronize SIGTERM Handling in DDP to Prevent Deadlocks #20825

Fix: Synchronize SIGTERM Handling in DDP to Prevent Deadlocks #20825

Uh oh!

KAVYANSHTYAGI commented May 13, 2025 •

edited by github-actions bot

Loading

Uh oh!

Borda left a comment

Uh oh!

KAVYANSHTYAGI left a comment

Uh oh!

lantiga left a comment

Uh oh!

Uh oh!

KAVYANSHTYAGI left a comment

Uh oh!

deependujha commented May 17, 2025

Uh oh!

KAVYANSHTYAGI left a comment

Uh oh!

Borda commented May 23, 2025

Uh oh!

deependujha commented May 23, 2025 •

edited

Loading

Uh oh!

Borda commented May 27, 2025

Uh oh!

KAVYANSHTYAGI left a comment

Uh oh!

Borda commented May 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fix: Synchronize SIGTERM Handling in DDP to Prevent Deadlocks #20825

Fix: Synchronize SIGTERM Handling in DDP to Prevent Deadlocks #20825

Uh oh!

Conversation

KAVYANSHTYAGI commented May 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Borda left a comment

Choose a reason for hiding this comment

Uh oh!

KAVYANSHTYAGI left a comment

Choose a reason for hiding this comment

Uh oh!

lantiga left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KAVYANSHTYAGI left a comment

Choose a reason for hiding this comment

Uh oh!

deependujha commented May 17, 2025

Uh oh!

KAVYANSHTYAGI left a comment

Choose a reason for hiding this comment

Uh oh!

Borda commented May 23, 2025

Uh oh!

deependujha commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Borda commented May 27, 2025

Uh oh!

KAVYANSHTYAGI left a comment

Choose a reason for hiding this comment

Uh oh!

Borda commented May 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KAVYANSHTYAGI commented May 13, 2025 •

edited by github-actions bot

Loading

deependujha commented May 23, 2025 •

edited

Loading