Skip to content

Make the Slurm Signal Configurable #14610

@Queuecumber

Description

@Queuecumber

🚀 Feature

Lightning currently assumes that the slurm preemption signal is set to USR1, it would be nice to add an option to SLURMEnvironment to change this.

Motivation

I use submitit to handle slurm submissions and they recently moved away from USR1 because of an NCCL conflict.

This breaks the requeuing in pytorch lightning.

See also facebookincubator/submitit#1709

Pitch

SLURMEnvironment can be trivially extended to account for this with something like

def __init__(self, auto_requeue: bool = True, signal: str = "USR1") -> None:
        super().__init__()
        self.auto_requeue = auto_requeue
        self.signal = signal

then SignalConnector ~ln 60 can be changed to something like:

sig = getattr(signal, "SIG" + environment.signal, None)
self._register_signal(sig, HandlersCompose(sigusr1_handlers))

cc @Borda @awaelchli

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions