-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Bug description
Pytorch lightning does not work when using another tool for SLURM scheduling. In particular, all my jobs receive a lot of SIGTERM
when using submitit.
This and similar issues seem to have been raised many times but never solved (maybe due to lack of reproducible code), see: #5969 #5225 (maybe #10154)
How to reproduce the bug
I made a minimal reproducible repo for the bug here. Please see the README there. Needless to say that you need SLURM, and hopefully the error does not depend on SLURM config.
The code only consists of scheduling some model on SLURM and checking the logs. The main code (main.py
) runs a logistic regression.
The rest is simply the SLURM config (config/sigterm.py
), where you should change the partition for your SLURM.
Once you run python main.py -m
this will schedule the job on SLURM and print the logging directory (eg multirun/2022-09-25/20-28-21/
). If you open the logging file (eg less multirun/2022-09-25/20-28-21/0/main.log
) you should see all the SIGTERM signals Bypassing signal SIGTERM
Error messages and logs
Important info
Please see the requirements.txt. The lightning version is 1.7.7
but I had those SIGTERMs since at least version 1.5
More info
More generally there should be an easy way to deactivate completely SLURM+pytorch lightning. This has already caused many issues (eg #6389 #3651 ) and will probably continue doing so. The thread in #6389 shows that there's really a lot of interest to be able to deactivate (as suggested by @Queuecumber @carmocca ) and it seems very cheap to do.
In my case, I often need 2 pytorch lightning models in a single script (self supervised learning + linear probing) so I want to be able to manage SLURM for multiple lightning trainers and thus don't want lightning to do it for me (there are also other reasons, this is the most prominent).
Tagging people that seem to have thoughts and knowledge about all of that: @awaelchli