Skip to content

_is_slurm_managing_tasks(self) incorrectly returns False when using SLURM's --gpu-bind=closest #13605

@casparvl

Description

@casparvl

🐛 Bug && steps to reproduce

I'm running the standard boring_model (or any model), with some minor changes to the arguments:

    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1000,
        limit_val_batches=1,
        limit_test_batches=1,
        num_sanity_val_steps=0,
        max_epochs=100,
        enable_model_summary=False,
        accelerator='gpu',
        gpus=-1,
        strategy='ddp',
    )

on our SLURM cluster like so:

srun -p gpu --cpus-per-task=18 --exclusive --mem-per-gpu=120000M --nodes=1 --ntasks-per-node=4 --gpus-per-node=4 --gpu-bind=closest python boring_model.py --accelerator 'gpu' --devices -1 --strategy ddp

however, it seems to spawns 8 processes: 4 'worlds' of size two. While running, each GPU is running two processes. I also don't see "Multiprocessing is handled by SLURM" in the output, which is expected from here.

Expected behavior

I expect srun to launch 4 tasks, and PyTorch Lightnings SLURMEnvironment to make these into a single world of size 4.

Environment

* CUDA:
        - GPU:
                - NVIDIA A100-SXM4-40GB
                - NVIDIA A100-SXM4-40GB
                - NVIDIA A100-SXM4-40GB
                - NVIDIA A100-SXM4-40GB
        - available:         True
        - version:           11.5
* Packages:
        - numpy:             1.20.3
        - pyTorch_debug:     False
        - pyTorch_version:   1.11.0+cu115
        - pytorch-lightning: 1.6.3
        - tqdm:              4.64.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.9.5
        - version:           #1 SMP Wed Apr 6 13:48:37 EDT 2022

Additional context

The issue first poped up in a real-world code where I was combining PT Lightning with Hydra, and showed up as a rather strange error message from Hydra. This thread contains a lot more analysis on that. However, I'll repeat/summarize the essentials here.

The problem arrises because _is_slurm_managing_tasks(self) in accelerator_connector.py is returning False, despite the tasks being launched by SLURM. The result is that PyTorch Lightning will itself try to spawn tasks, from each of the four processes already launched by srun. As we'll see later, each process sees 2 GPUs, and thus launches two tasks, thus I end up with my final total of 8 tasks.

Since PyTorch Lightning thinks it's running outside of a SLURM context, that also causes the hydra error. This is because Lightning will use subprocess_script.py to try and launch a new process. Here it will add hydra.run.dir to the original argument list and call the original command again, but since the original command is the _submit.py script (and not the PyTorch lightning training script) this fails in an unrecognized argument error.

That part aside, the real problem is: why doesn't Lightning recognize that slurm is managing the tasks? Well, that's because of this line. Lightning assumes two things here:

  1. That self._parallel_devices contains the number of physical parallel devices that is available per node
  2. That each node has the same number of parallel devices.

I won't go deeper into number (2), since that assumption isn't broken here - I just want to remark that _is_slurm_managing_tasks() logic would also break in that case. In our case, the problem is (1).

Let's see what's going on step by step. self._parallel_devices is set here, based on a call to self.accelerator.get_parallel_devices(self._devices_flag). In turn, the self._devices_flag is set here in my case, based on a call to self.accelerator.auto_device_count(). Since my accelerator is a GPU, self.accelerator is a GPUAccelerator object. Thus, it's calling this function. And this, suprisingly, is where the problem lies. auto_device_count will return the number of devices that that process has access to. That is potentially not the same as the number of _physical parallel devices that is available per node` in a SLURM allocation.

To demonstrate, consider the following run:

srun -p gpu --cpus-per-task=18 --exclusive --mem-per-gpu=120000M --nodes=1 --ntasks-per-node=4 --gpus-per-node=4 --gpu-bind=closest env | grep CUDA_VISIBLE_DEVICES
srun: job 1225150 queued and waiting for resources
srun: job 1225150 has been allocated resources
CUDA_VISIBLE_DEVICES=2,3
CUDA_VISIBLE_DEVICES=2,3
CUDA_VISIBLE_DEVICES=0,1
CUDA_VISIBLE_DEVICES=0,1

As you can see, SLURM launches 4 tasks. Each of those tasks get access to a subset of the devices available on the node. In this case, auto_device_count will return 2 - the number of GPUs that that particular process has access to. The reason it only has access to two is because of the --gpu-bind=closest argument (see the documentation of srun). This will make sure that GPUs are bound to each task that are closest (in a NUMA-sense) to the CPU processes controlling them. To demonstrate:

srun -p gpu --cpus-per-task=18 --exclusive --mem-per-gpu=120000M --nodes=1 --ntasks-per-node=4 --gpus-per-node=4 --gpu-bind=closest numactl --show
srun: job 1225362 queued and waiting for resources
srun: job 1225362 has been allocated resources
policy: default
preferred node: current
physcpubind: 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
cpubind: 1
nodebind: 1
membind: 0 1
policy: default
preferred node: current
physcpubind: 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
cpubind: 1
nodebind: 1
membind: 0 1
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
cpubind: 0
nodebind: 0
membind: 0 1
policy: default
preferred node: current
physcpubind: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
cpubind: 0
nodebind: 0
membind: 0 1

As you can see, each task gets bound to 18 cores. This system has 36 cores per socket, and 2 sockets in total. SLURM binds task 0 to CPUs 0-17 on Socket 0 and sets CUDA_VISIBLE_DEVICES=0,1 for that task, because those two GPUs are attached to the PCI bridge of that socket. Similarly, it binds task 1 to CPUs 18-35 on Socket 0 and also sets CUDA_VISIBLE_DEVICES=0,1 (because those CPUs are still on Socket 0, so these are attached to that PCI bridge). Then, it binds task 2 to CPUs 36-53 on Socket 1, and set CUDA_VISIBLE_DEVICES=2,3, and it binds task 3 to CPUs 54-71 on Socket 1, and sets CUDA_VISIBLE_DEVICES=2,3. Please note that all of this is completely intended and correct SLURM behaviour. Thus, ideally, the SLURMEnvironment should be able to deal with it - yet it isn't.

I have tried to just hard-code the output of _is_slurm_managing_tasks to True, just to see if that would fix it. However, this just simply produces more errors: the SLURMEnvironment is now being used, but it results in

Traceback (most recent call last):
  File "/gpfs/home4/casparl/2D-VQ-AE-2/boring_model.py", line 64, in <module>
    run()
  File "/gpfs/home4/casparl/2D-VQ-AE-2/boring_model.py", line 61, in run
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
  File "/gpfs/home4/casparl/2D-VQ-AE-2/__pypackages__/3.9/lib/pytorch_lightning/trainer/trainer.py", line 768, in fit
    self._call_and_handle_interrupt(
  File "/gpfs/home4/casparl/2D-VQ-AE-2/__pypackages__/3.9/lib/pytorch_lightning/trainer/trainer.py", line 736, in _call_and_handle_interrupt
    self._teardown()
  File "/gpfs/home4/casparl/2D-VQ-AE-2/__pypackages__/3.9/lib/pytorch_lightning/trainer/trainer.py", line 1298, in _teardown
    self.strategy.teardown()
  File "/gpfs/home4/casparl/2D-VQ-AE-2/__pypackages__/3.9/lib/pytorch_lightning/strategies/ddp.py", line 471, in teardown
    if self.root_device.type == "cuda":
  File "/gpfs/home4/casparl/2D-VQ-AE-2/__pypackages__/3.9/lib/pytorch_lightning/strategies/ddp.py", line 117, in root_device
    return self.parallel_devices[self.local_rank]
IndexError: list index out of range

probably because the self._devices_flag is set to [0,1] for each task, even though two out of the tasks only have access to devices [2,3].

I'm not really sure how to 'fix' this, it would require changes both in the _is_slurm_managing_tasks function, but also in the way that SLURMEnvironment assigns devices to each of the processes...

cc @awaelchli @akihironitta

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingenvironment: slurmplGeneric label for PyTorch Lightning package

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions