Skip to content

Conversation

KAVYANSHTYAGI
Copy link
Contributor

@KAVYANSHTYAGI KAVYANSHTYAGI commented May 13, 2025

What does this PR do?

This PR resolves a critical issue in PyTorch Lightning's Distributed Data Parallel (DDP) training where a SIGTERM signal received by one rank (typically rank 0) can cause deadlocks due to unsynchronized termination behavior across ranks.

Problem Description

In DDP training, PyTorch requires all processes to synchronize at every training step. If one rank exits early (e.g., due to receiving a SIGTERM signal), while others continue to the next training step, they hang indefinitely waiting at synchronization barriers such as all_gather or reduce.

This issue becomes highly problematic in production environments, especially in distributed setups orchestrated by Kubernetes or SLURM, where preemptions or resource eviction events are frequent.

Specifically, the flow causing the bug:

  1. Rank 0 receives SIGTERM and raises SIGTERMException after on_advance_end.
  2. Other ranks do not receive this signal and proceed into the next batch.
  3. These ranks block at the beginning of the next batch, waiting for rank 0 — which has already exited.
  4. This results in a distributed deadlock.

Fix

This PR introduces a synchronized SIGTERM handling mechanism using torch.distributed.broadcast():

  • Upon receiving a SIGTERM, rank 0 sets a sigterm_tensor = 1 and broadcasts it to all other ranks.
  • At the start of each training batch, every rank checks this broadcasted signal.
  • If the signal indicates a SIGTERM was received:
    • All ranks call torch.distributed.barrier() to synchronize safely.
    • Then, they raise SIGTERMException in unison, exiting the training loop cleanly.

This ensures no rank continues into the next batch while others have exited, thereby preventing deadlocks and enabling graceful termination and checkpointing.

Test Included

  • test_ddp_sigterm_handling.py:
    • Simulates a multi-process DDP training setup.
    • Injects SIGTERM in rank 0 during batch 2.
    • Asserts that:
      • All ranks receive the SIGTERM via broadcast.
      • All ranks raise SIGTERMException gracefully.
      • No deadlock occurs.

This test is guarded with @pytest.mark.skipif to skip in single-device or non-distributed setups, ensuring safety across CI platforms.


Fixes #20806

Before submitting

  • Was this discussed/agreed via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (not necessary here)
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request? (None)
  • Did you update the CHANGELOG? (Optional for internal changes)

Reviewer checklist

  • Is this pull request ready for review?
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Fun fact: This fix is critical for scaling PL in real-world clusters with robust SIGTERM checkpointing and graceful exits. Hope this PR saves someone’s multi-day training job from being lost....


📚 Documentation preview 📚: https://pytorch-lightning--20825.org.readthedocs.build/en/20825/

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label May 13, 2025
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

over all looks good, just some test on CPU seems not being happy about it..

Copy link
Contributor Author

@KAVYANSHTYAGI KAVYANSHTYAGI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR now adds a synchronized SIGTERM handling for DDP training in PyTorch Lightning which safely broadcasts termination across all ranks to prevent deadlocks and exits cleanly via SIGTERMException. All distributed logic is CI-safe and DDP tests are conditionally skipped on unsupported environments. The patch is now robust, minimal and production friendly.

Copy link
Collaborator

@lantiga lantiga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great fix @KAVYANSHTYAGI

Copy link
Contributor Author

@KAVYANSHTYAGI KAVYANSHTYAGI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This update refines SIGTERM handling in DDP training by ensuring that SIGTERMException is only raised after a successful broadcast and is no longer suppressed unintentionally. The fix moves the exception outside the try block, addressing concerns about it being silently ignored.

@deependujha
Copy link
Collaborator

Hi @KAVYANSHTYAGI can you please fix pre-commit. It already gives you hint to use contextlib.suppress.

Copy link
Contributor Author

@KAVYANSHTYAGI KAVYANSHTYAGI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Probot / required-jobs failed due to a permissions issue (403 Forbidden when trying to comment) and a 60-minute timeout.

Copilot suggests updating .github/workflows/probot-check-group.yml with:

permissions:
contents: write
pull-requests: write
timeout-minutes: 90

Can you help update this so future PRs aren’t blocked?
Do let me know what do you think or is there any other way I can handle this....

@Borda
Copy link
Member

Borda commented May 23, 2025

Copilot suggests updating .github/workflows/probot-check-group.yml with:

I see many failing checks not just that one so I would not bother with it until real testing is resolved :)

@deependujha
Copy link
Collaborator

deependujha commented May 23, 2025

@KAVYANSHTYAGI @Borda I guess failing jobs yesterday or day before yesterday was due to github action being down. Try either rerunning the job or make some changes and push.

Screenshot 2025-05-22 at 4 42 00 PM

@Borda
Copy link
Member

Borda commented May 27, 2025

restarted all CI

Copy link
Contributor Author

@KAVYANSHTYAGI KAVYANSHTYAGI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thankfully all the test cases have passed, please review it. Thank you for your support.

@Borda
Copy link
Member

Borda commented May 28, 2025

Thankfully all the test cases have passed, please review it. Thank you for your support.

Looks great, thank you

@Borda Borda merged commit 989b759 into Lightning-AI:master May 28, 2025
84 checks passed
Borda added a commit that referenced this pull request Jun 19, 2025
* Update signal_connector.py
* Update training_epoch_loop.py
* Create test_ddp_sigterm_handling.py
* update + chlog
* Apply suggestions from code review

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka B <[email protected]>
Co-authored-by: Jirka Borovec <[email protected]>
(cherry picked from commit 989b759)
sudiptob2 pushed a commit to sudiptob2/pytorch-lightning that referenced this pull request Jun 27, 2025
…ing-AI#20825)

* Update signal_connector.py
* Update training_epoch_loop.py
* Create test_ddp_sigterm_handling.py
* update + chlog
* Apply suggestions from code review

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka B <[email protected]>
Co-authored-by: Jirka Borovec <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pl Generic label for PyTorch Lightning package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SIGTERMException is not raised consistently across all ranks in DDP
4 participants