-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
communityThis PR is from the communityThis PR is from the communityenvironment: slurmfeatureIs an improvement or enhancementIs an improvement or enhancement
Milestone
Description
🚀 Feature
Lightning currently assumes that the slurm preemption signal is set to USR1, it would be nice to add an option to SLURMEnvironment to change this.
Motivation
I use submitit to handle slurm submissions and they recently moved away from USR1 because of an NCCL conflict.
This breaks the requeuing in pytorch lightning.
See also facebookincubator/submitit#1709
Pitch
SLURMEnvironment
can be trivially extended to account for this with something like
def __init__(self, auto_requeue: bool = True, signal: str = "USR1") -> None:
super().__init__()
self.auto_requeue = auto_requeue
self.signal = signal
then SignalConnector
~ln 60 can be changed to something like:
sig = getattr(signal, "SIG" + environment.signal, None)
self._register_signal(sig, HandlersCompose(sigusr1_handlers))
cc @Borda @awaelchli
Metadata
Metadata
Assignees
Labels
communityThis PR is from the communityThis PR is from the communityenvironment: slurmfeatureIs an improvement or enhancementIs an improvement or enhancement