-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug && steps to reproduce
I'm running the standard boring_model
(or any model), with some minor changes to the arguments:
trainer = Trainer(
default_root_dir=os.getcwd(),
limit_train_batches=1000,
limit_val_batches=1,
limit_test_batches=1,
num_sanity_val_steps=0,
max_epochs=100,
enable_model_summary=False,
accelerator='gpu',
gpus=-1,
strategy='ddp',
)
on our SLURM cluster like so:
srun -p gpu --cpus-per-task=18 --exclusive --mem-per-gpu=120000M --nodes=1 --ntasks-per-node=4 --gpus-per-node=4 --gpu-bind=closest python boring_model.py --accelerator 'gpu' --devices -1 --strategy ddp
however, it seems to spawns 8 processes: 4 'worlds' of size two. While running, each GPU is running two processes. I also don't see "Multiprocessing is handled by SLURM" in the output, which is expected from here.
Expected behavior
I expect srun
to launch 4 tasks, and PyTorch Lightnings SLURMEnvironment to make these into a single world of size 4.
Environment
* CUDA:
- GPU:
- NVIDIA A100-SXM4-40GB
- NVIDIA A100-SXM4-40GB
- NVIDIA A100-SXM4-40GB
- NVIDIA A100-SXM4-40GB
- available: True
- version: 11.5
* Packages:
- numpy: 1.20.3
- pyTorch_debug: False
- pyTorch_version: 1.11.0+cu115
- pytorch-lightning: 1.6.3
- tqdm: 4.64.0
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.9.5
- version: #1 SMP Wed Apr 6 13:48:37 EDT 2022
Additional context
The issue first poped up in a real-world code where I was combining PT Lightning with Hydra, and showed up as a rather strange error message from Hydra. This thread contains a lot more analysis on that. However, I'll repeat/summarize the essentials here.
The problem arrises because _is_slurm_managing_tasks(self)
in accelerator_connector.py
is returning False
, despite the tasks being launched by SLURM. The result is that PyTorch Lightning will itself try to spawn tasks, from each of the four processes already launched by srun
. As we'll see later, each process sees 2 GPUs, and thus launches two tasks, thus I end up with my final total of 8 tasks.
Since PyTorch Lightning thinks it's running outside of a SLURM context, that also causes the hydra error. This is because Lightning will use subprocess_script.py
to try and launch a new process. Here it will add hydra.run.dir
to the original argument list and call the original command again, but since the original command is the _submit.py
script (and not the PyTorch lightning training script) this fails in an unrecognized argument error.
That part aside, the real problem is: why doesn't Lightning recognize that slurm is managing the tasks? Well, that's because of this line. Lightning assumes two things here:
- That
self._parallel_devices
contains the number of physical parallel devices that is available per node - That each node has the same number of parallel devices.
I won't go deeper into number (2), since that assumption isn't broken here - I just want to remark that _is_slurm_managing_tasks()
logic would also break in that case. In our case, the problem is (1).
Let's see what's going on step by step. self._parallel_devices
is set here, based on a call to self.accelerator.get_parallel_devices(self._devices_flag)
. In turn, the self._devices_flag
is set here in my case, based on a call to self.accelerator.auto_device_count()
. Since my accelerator is a GPU, self.accelerator
is a GPUAccelerator
object. Thus, it's calling this function. And this, suprisingly, is where the problem lies. auto_device_count
will return the number of devices that that process has access to. That is potentially not the same as the number of _physical parallel devices that is available per node` in a SLURM allocation.
To demonstrate, consider the following run:
srun -p gpu --cpus-per-task=18 --exclusive --mem-per-gpu=120000M --nodes=1 --ntasks-per-node=4 --gpus-per-node=4 --gpu-bind=closest env | grep CUDA_VISIBLE_DEVICES
srun: job 1225150 queued and waiting for resources
srun: job 1225150 has been allocated resources
CUDA_VISIBLE_DEVICES=2,3
CUDA_VISIBLE_DEVICES=2,3
CUDA_VISIBLE_DEVICES=0,1
CUDA_VISIBLE_DEVICES=0,1
As you can see, SLURM launches 4 tasks. Each of those tasks get access to a subset of the devices available on the node. In this case, auto_device_count
will return 2 - the number of GPUs that that particular process has access to. The reason it only has access to two is because of the --gpu-bind=closest
argument (see the documentation of srun). This will make sure that GPUs are bound to each task that are closest (in a NUMA-sense) to the CPU processes controlling them. To demonstrate:
srun -p gpu --cpus-per-task=18 --exclusive --mem-per-gpu=120000M --nodes=1 --ntasks-per-node=4 --gpus-per-node=4 --gpu-bind=closest numactl --show
srun: job 1225362 queued and waiting for resources
srun: job 1225362 has been allocated resources
policy: default
preferred node: current
physcpubind: 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
cpubind: 1
nodebind: 1
membind: 0 1
policy: default
preferred node: current
physcpubind: 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
cpubind: 1
nodebind: 1
membind: 0 1
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
cpubind: 0
nodebind: 0
membind: 0 1
policy: default
preferred node: current
physcpubind: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
cpubind: 0
nodebind: 0
membind: 0 1
As you can see, each task gets bound to 18 cores. This system has 36 cores per socket, and 2 sockets in total. SLURM binds task 0 to CPUs 0-17 on Socket 0 and sets CUDA_VISIBLE_DEVICES=0,1
for that task, because those two GPUs are attached to the PCI bridge of that socket. Similarly, it binds task 1 to CPUs 18-35 on Socket 0 and also sets CUDA_VISIBLE_DEVICES=0,1
(because those CPUs are still on Socket 0, so these are attached to that PCI bridge). Then, it binds task 2 to CPUs 36-53 on Socket 1, and set CUDA_VISIBLE_DEVICES=2,3
, and it binds task 3 to CPUs 54-71 on Socket 1, and sets CUDA_VISIBLE_DEVICES=2,3
. Please note that all of this is completely intended and correct SLURM behaviour. Thus, ideally, the SLURMEnvironment
should be able to deal with it - yet it isn't.
I have tried to just hard-code the output of _is_slurm_managing_tasks
to True
, just to see if that would fix it. However, this just simply produces more errors: the SLURMEnvironment
is now being used, but it results in
Traceback (most recent call last):
File "/gpfs/home4/casparl/2D-VQ-AE-2/boring_model.py", line 64, in <module>
run()
File "/gpfs/home4/casparl/2D-VQ-AE-2/boring_model.py", line 61, in run
trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
File "/gpfs/home4/casparl/2D-VQ-AE-2/__pypackages__/3.9/lib/pytorch_lightning/trainer/trainer.py", line 768, in fit
self._call_and_handle_interrupt(
File "/gpfs/home4/casparl/2D-VQ-AE-2/__pypackages__/3.9/lib/pytorch_lightning/trainer/trainer.py", line 736, in _call_and_handle_interrupt
self._teardown()
File "/gpfs/home4/casparl/2D-VQ-AE-2/__pypackages__/3.9/lib/pytorch_lightning/trainer/trainer.py", line 1298, in _teardown
self.strategy.teardown()
File "/gpfs/home4/casparl/2D-VQ-AE-2/__pypackages__/3.9/lib/pytorch_lightning/strategies/ddp.py", line 471, in teardown
if self.root_device.type == "cuda":
File "/gpfs/home4/casparl/2D-VQ-AE-2/__pypackages__/3.9/lib/pytorch_lightning/strategies/ddp.py", line 117, in root_device
return self.parallel_devices[self.local_rank]
IndexError: list index out of range
probably because the self._devices_flag
is set to [0,1]
for each task, even though two out of the tasks only have access to devices [2,3]
.
I'm not really sure how to 'fix' this, it would require changes both in the _is_slurm_managing_tasks
function, but also in the way that SLURMEnvironment
assigns devices to each of the processes...