`_is_slurm_managing_tasks(self)` incorrectly returns `False` when using SLURM's `--gpu-bind=closest`

## 🐛 Bug && steps to reproduce

I'm running the standard `boring_model` (or any model), with some minor changes to the arguments:

```
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1000,
        limit_val_batches=1,
        limit_test_batches=1,
        num_sanity_val_steps=0,
        max_epochs=100,
        enable_model_summary=False,
        accelerator='gpu',
        gpus=-1,
        strategy='ddp',
    )
```

 on our SLURM cluster like so:

```
srun -p gpu --cpus-per-task=18 --exclusive --mem-per-gpu=120000M --nodes=1 --ntasks-per-node=4 --gpus-per-node=4 --gpu-bind=closest python boring_model.py --accelerator 'gpu' --devices -1 --strategy ddp
```

however, it seems to spawns 8 processes: 4 'worlds' of size two. While running, each GPU is running two processes. I also don't see "Multiprocessing is handled by SLURM" in the output, which is expected from [here](https://github.com/Lightning-AI/lightning/blob/933848dc05a0698eb5cd5a39216fd175f5e1b920/src/pytorch_lightning/trainer/connectors/accelerator_connector.py#L545).

### Expected behavior

I expect `srun` to launch 4 tasks, and PyTorch Lightnings SLURMEnvironment to make these into a single world of size 4.

### Environment

```
* CUDA:
        - GPU:
                - NVIDIA A100-SXM4-40GB
                - NVIDIA A100-SXM4-40GB
                - NVIDIA A100-SXM4-40GB
                - NVIDIA A100-SXM4-40GB
        - available:         True
        - version:           11.5
* Packages:
        - numpy:             1.20.3
        - pyTorch_debug:     False
        - pyTorch_version:   1.11.0+cu115
        - pytorch-lightning: 1.6.3
        - tqdm:              4.64.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.9.5
        - version:           #1 SMP Wed Apr 6 13:48:37 EDT 2022
```

### Additional context

The issue first poped up in a real-world code where I was combining PT Lightning with Hydra, and showed up as a rather strange error message from Hydra. [This thread](https://github.com/facebookresearch/hydra/issues/2231) contains a lot more analysis on that. However, I'll repeat/summarize the essentials here.

The problem arrises because `_is_slurm_managing_tasks(self)` in `accelerator_connector.py` is returning `False`, despite the tasks being launched by SLURM. The result is that PyTorch Lightning will itself try to spawn tasks, from _each_ of the four processes already launched by `srun`. As we'll see later, each process sees 2 GPUs, and thus launches two tasks, thus I end up with my final total of 8 tasks.

Since PyTorch Lightning thinks it's running outside of a SLURM context, that _also_ causes the [hydra error](https://github.com/facebookresearch/hydra/issues/2231). This is because Lightning will use `subprocess_script.py` to try and launch a new process. [Here](https://github.com/Lightning-AI/lightning/blob/933848dc05a0698eb5cd5a39216fd175f5e1b920/src/pytorch_lightning/strategies/launchers/subprocess_script.py#L144) it will add `hydra.run.dir` to the original argument list and call the original command again, but since the original command is the `_submit.py` script (and not the PyTorch lightning training script) this fails in an unrecognized argument error.

That part aside, the real problem is: why doesn't Lightning recognize that slurm is managing the tasks? Well, that's because of [this line](https://github.com/Lightning-AI/lightning/blob/933848dc05a0698eb5cd5a39216fd175f5e1b920/src/pytorch_lightning/trainer/connectors/accelerator_connector.py#L557). Lightning assumes two things here:

1. That `self._parallel_devices` contains the number of physical parallel devices that is available per node
2. That each node has the same number of parallel devices.

I won't go deeper into number (2), since that assumption isn't broken here - I just want to remark that `_is_slurm_managing_tasks()` logic would also break in that case. In our case, the problem is (1). 

Let's see what's going on step by step. `self._parallel_devices` is set [here](https://github.com/Lightning-AI/lightning/blob/933848dc05a0698eb5cd5a39216fd175f5e1b920/src/pytorch_lightning/trainer/connectors/accelerator_connector.py#L530), based on a call to `self.accelerator.get_parallel_devices(self._devices_flag)`. In turn, the `self._devices_flag` is set [here](https://github.com/Lightning-AI/lightning/blob/933848dc05a0698eb5cd5a39216fd175f5e1b920/src/pytorch_lightning/trainer/connectors/accelerator_connector.py#L534) in my case, based on a call to `self.accelerator.auto_device_count()`. Since my accelerator is a GPU, `self.accelerator` is a `GPUAccelerator` object. Thus, it's calling [this function](https://github.com/Lightning-AI/lightning/blob/933848dc05a0698eb5cd5a39216fd175f5e1b920/src/pytorch_lightning/accelerators/gpu.py#L85). And this, suprisingly, is where the problem lies. `auto_device_count` will return the number of devices that _that process has access to_. That is potentially _not_ the same as the number of _physical parallel devices that is available per node` in a SLURM allocation.

To demonstrate, consider the following run:

```
srun -p gpu --cpus-per-task=18 --exclusive --mem-per-gpu=120000M --nodes=1 --ntasks-per-node=4 --gpus-per-node=4 --gpu-bind=closest env | grep CUDA_VISIBLE_DEVICES
srun: job 1225150 queued and waiting for resources
srun: job 1225150 has been allocated resources
CUDA_VISIBLE_DEVICES=2,3
CUDA_VISIBLE_DEVICES=2,3
CUDA_VISIBLE_DEVICES=0,1
CUDA_VISIBLE_DEVICES=0,1
```
As you can see, SLURM launches 4 tasks. Each of those tasks get access to a _subset_ of the devices available on the node. In this case, `auto_device_count` will return 2 - the number of GPUs that _that particular process_ has access to. The reason it only has access to two is because of the `--gpu-bind=closest` argument (see [the documentation of srun](https://slurm.schedmd.com/srun.html)). This will make sure that GPUs are bound to each task that are _closest_ (in a NUMA-sense) to the CPU processes controlling them. To demonstrate:
```
srun -p gpu --cpus-per-task=18 --exclusive --mem-per-gpu=120000M --nodes=1 --ntasks-per-node=4 --gpus-per-node=4 --gpu-bind=closest numactl --show
srun: job 1225362 queued and waiting for resources
srun: job 1225362 has been allocated resources
policy: default
preferred node: current
physcpubind: 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
cpubind: 1
nodebind: 1
membind: 0 1
policy: default
preferred node: current
physcpubind: 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
cpubind: 1
nodebind: 1
membind: 0 1
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
cpubind: 0
nodebind: 0
membind: 0 1
policy: default
preferred node: current
physcpubind: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
cpubind: 0
nodebind: 0
membind: 0 1
```
As you can see, each task gets bound to 18 cores. This system has 36 cores per socket, and 2 sockets in total. SLURM binds task 0 to CPUs 0-17 on Socket 0 and sets `CUDA_VISIBLE_DEVICES=0,1` for that task, because those two GPUs are attached to the PCI bridge of that socket. Similarly, it binds task 1 to CPUs 18-35 on Socket 0 and also sets `CUDA_VISIBLE_DEVICES=0,1` (because those CPUs are still on Socket 0, so these are attached to that PCI bridge). Then, it binds task 2 to CPUs 36-53 on Socket 1, and set `CUDA_VISIBLE_DEVICES=2,3`, and it binds task 3 to CPUs 54-71 on Socket 1, and sets `CUDA_VISIBLE_DEVICES=2,3`. Please note that all of this is _completely intended and correct SLURM behaviour_. Thus, ideally, the `SLURMEnvironment` _should_ be able to deal with it - yet it isn't.

I have tried to just hard-code the output of `_is_slurm_managing_tasks` to `True`, just to see if that would fix it. However, this just simply produces more errors: the `SLURMEnvironment` is now being used, but it results in 

```
Traceback (most recent call last):
  File "/gpfs/home4/casparl/2D-VQ-AE-2/boring_model.py", line 64, in <module>
    run()
  File "/gpfs/home4/casparl/2D-VQ-AE-2/boring_model.py", line 61, in run
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
  File "/gpfs/home4/casparl/2D-VQ-AE-2/__pypackages__/3.9/lib/pytorch_lightning/trainer/trainer.py", line 768, in fit
    self._call_and_handle_interrupt(
  File "/gpfs/home4/casparl/2D-VQ-AE-2/__pypackages__/3.9/lib/pytorch_lightning/trainer/trainer.py", line 736, in _call_and_handle_interrupt
    self._teardown()
  File "/gpfs/home4/casparl/2D-VQ-AE-2/__pypackages__/3.9/lib/pytorch_lightning/trainer/trainer.py", line 1298, in _teardown
    self.strategy.teardown()
  File "/gpfs/home4/casparl/2D-VQ-AE-2/__pypackages__/3.9/lib/pytorch_lightning/strategies/ddp.py", line 471, in teardown
    if self.root_device.type == "cuda":
  File "/gpfs/home4/casparl/2D-VQ-AE-2/__pypackages__/3.9/lib/pytorch_lightning/strategies/ddp.py", line 117, in root_device
    return self.parallel_devices[self.local_rank]
IndexError: list index out of range
```

probably because the `self._devices_flag` is set to `[0,1]` for _each_ task, even though two out of the tasks only have access to devices `[2,3]`.

I'm not really sure how to 'fix' this, it would require changes both in the `_is_slurm_managing_tasks` function, but also in the way that `SLURMEnvironment` assigns devices to each of the processes...

cc @awaelchli @akihironitta

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`_is_slurm_managing_tasks(self)` incorrectly returns `False` when using SLURM's `--gpu-bind=closest` #13605

🐛 Bug && steps to reproduce

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

_is_slurm_managing_tasks(self) incorrectly returns False when using SLURM's --gpu-bind=closest #13605

Description

🐛 Bug && steps to reproduce

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`_is_slurm_managing_tasks(self)` incorrectly returns `False` when using SLURM's `--gpu-bind=closest` #13605