-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Fix: Synchronize SIGTERM Handling in DDP to Prevent Deadlocks #20825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
over all looks good, just some test on CPU seems not being happy about it..
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR now adds a synchronized SIGTERM handling for DDP training in PyTorch Lightning which safely broadcasts termination across all ranks to prevent deadlocks and exits cleanly via SIGTERMException. All distributed logic is CI-safe and DDP tests are conditionally skipped on unsupported environments. The patch is now robust, minimal and production friendly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great fix @KAVYANSHTYAGI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This update refines SIGTERM handling in DDP training by ensuring that SIGTERMException is only raised after a successful broadcast and is no longer suppressed unintentionally. The fix moves the exception outside the try block, addressing concerns about it being silently ignored.
Hi @KAVYANSHTYAGI can you please fix pre-commit. It already gives you hint to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like Probot / required-jobs failed due to a permissions issue (403 Forbidden when trying to comment) and a 60-minute timeout.
Copilot suggests updating .github/workflows/probot-check-group.yml with:
permissions:
contents: write
pull-requests: write
timeout-minutes: 90
Can you help update this so future PRs aren’t blocked?
Do let me know what do you think or is there any other way I can handle this....
I see many failing checks not just that one so I would not bother with it until real testing is resolved :) |
@KAVYANSHTYAGI @Borda I guess failing jobs yesterday or day before yesterday was due to github action being down. Try either rerunning the job or make some changes and push. ![]() |
restarted all CI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thankfully all the test cases have passed, please review it. Thank you for your support.
Looks great, thank you |
* Update signal_connector.py * Update training_epoch_loop.py * Create test_ddp_sigterm_handling.py * update + chlog * Apply suggestions from code review --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka B <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> (cherry picked from commit 989b759)
…ing-AI#20825) * Update signal_connector.py * Update training_epoch_loop.py * Create test_ddp_sigterm_handling.py * update + chlog * Apply suggestions from code review --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka B <[email protected]> Co-authored-by: Jirka Borovec <[email protected]>
What does this PR do?
This PR resolves a critical issue in PyTorch Lightning's Distributed Data Parallel (DDP) training where a SIGTERM signal received by one rank (typically rank 0) can cause deadlocks due to unsynchronized termination behavior across ranks.
Problem Description
In DDP training, PyTorch requires all processes to synchronize at every training step. If one rank exits early (e.g., due to receiving a
SIGTERM
signal), while others continue to the next training step, they hang indefinitely waiting at synchronization barriers such asall_gather
orreduce
.This issue becomes highly problematic in production environments, especially in distributed setups orchestrated by Kubernetes or SLURM, where preemptions or resource eviction events are frequent.
Specifically, the flow causing the bug:
SIGTERMException
afteron_advance_end
.Fix
This PR introduces a synchronized SIGTERM handling mechanism using
torch.distributed.broadcast()
:sigterm_tensor = 1
and broadcasts it to all other ranks.torch.distributed.barrier()
to synchronize safely.SIGTERMException
in unison, exiting the training loop cleanly.This ensures no rank continues into the next batch while others have exited, thereby preventing deadlocks and enabling graceful termination and checkpointing.
Test Included
test_ddp_sigterm_handling.py
:SIGTERM
in rank 0 during batch 2.SIGTERMException
gracefully.This test is guarded with
@pytest.mark.skipif
to skip in single-device or non-distributed setups, ensuring safety across CI platforms.Fixes #20806
Before submitting
Reviewer checklist
Fun fact: This fix is critical for scaling PL in real-world clusters with robust SIGTERM checkpointing and graceful exits. Hope this PR saves someone’s multi-day training job from being lost....
📚 Documentation preview 📚: https://pytorch-lightning--20825.org.readthedocs.build/en/20825/