Skip to content

Lightning sends SIGTERM when using other SLURM manager #14893

@YannDubs

Description

@YannDubs

Bug description

Pytorch lightning does not work when using another tool for SLURM scheduling. In particular, all my jobs receive a lot of SIGTERM when using submitit.

This and similar issues seem to have been raised many times but never solved (maybe due to lack of reproducible code), see: #5969 #5225 (maybe #10154)

How to reproduce the bug

I made a minimal reproducible repo for the bug here. Please see the README there. Needless to say that you need SLURM, and hopefully the error does not depend on SLURM config.

The code only consists of scheduling some model on SLURM and checking the logs. The main code (main.py) runs a logistic regression.

The rest is simply the SLURM config (config/sigterm.py), where you should change the partition for your SLURM.

Once you run python main.py -m this will schedule the job on SLURM and print the logging directory (eg multirun/2022-09-25/20-28-21/). If you open the logging file (eg less multirun/2022-09-25/20-28-21/0/main.log) you should see all the SIGTERM signals Bypassing signal SIGTERM

Error messages and logs

Screenshot 2022-09-25 at 20 53 42

Important info

Please see the requirements.txt. The lightning version is 1.7.7 but I had those SIGTERMs since at least version 1.5

More info

More generally there should be an easy way to deactivate completely SLURM+pytorch lightning. This has already caused many issues (eg #6389 #3651 ) and will probably continue doing so. The thread in #6389 shows that there's really a lot of interest to be able to deactivate (as suggested by @Queuecumber @carmocca ) and it seems very cheap to do.

In my case, I often need 2 pytorch lightning models in a single script (self supervised learning + linear probing) so I want to be able to manage SLURM for multiple lightning trainers and thus don't want lightning to do it for me (there are also other reasons, this is the most prominent).

Tagging people that seem to have thoughts and knowledge about all of that: @awaelchli

Metadata

Metadata

Assignees

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions