DDP fails on multinode if each node has a different number of GPUS

### Bug description

Hi there,

I'm launching a PyTorch-Lightning script in a multinode environment.
In order to do so, I have followed the suggestions at this link:
https://pytorch-lightning.readthedocs.io/en/stable/clouds/cluster_intermediate_1.html

I have configured a bash script that gets the number of nodes assigned and the number of GPUS on each node, as well as the MASTER NODE, MASTER ADDRESS, etc...

In the bash script I launch the mpirun command on every node, like:


`mpirun -np 1 -H $NODE $EXEC $SCRIPT &`

where `$NODE` is the current node, `$EXEC` is the python3 command, and `$SCRIPT` is my python script.

In the python script (which is launched N nodes times), I correctly assign the variables:

```
os.environ["MASTER_ADDR"] = args.master
os.environ["MASTER_PORT"] = args.port
os.environ["WORLD_SIZE"] = args.world
os.environ["NODE_RANK"] = args.rank
```
Then, in the trainer:
```
trainer = pl.Trainer(strategy=DDPStrategy(find_unused_parameters=False), accelerator='gpu', devices=gpus, num_nodes=world_size)
```
Everything works fine if the Scheduler assigns the same number of GPUS at each Node.
(2 GPUS on 4 NODES in the following example, with 8 gpus in total)

```
# Note: some prints are from my bash script
assigned nodes: gnode10 gnode16 gnode31 gnode32
assigned gpus per node: 2 2 2 2
WORLD_SIZE 4
MASTER gnode10
PORT 11551
bash script here! Waiting for all jobs to finish...
#############################################################################
# Here start the 4 scripts launched by mpirun (in this case WORLD SIZE = 4)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-e2d7ffc6-fe40-bbff-cd8b-77571d64b1bc,GPU-ec7f5880-fe70-668a-2fe9-6b51118ff2ab]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [GPU-e2d7ffc6-fe40-bbff-cd8b-77571d64b1bc,GPU-ec7f5880-fe70-668a-2fe9-6b51118ff2ab]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-80c32807-6dbf-8b68-4740-086cece4b91a,GPU-0148835f-d43f-b272-72d2-38a6dea989d2]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [GPU-80c32807-6dbf-8b68-4740-086cece4b91a,GPU-0148835f-d43f-b272-72d2-38a6dea989d2]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-7c411134-8399-b08e-2f07-d45fdbff38c5,GPU-ce3d447b-109b-8294-c7a3-d4b1b36ede2e]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [GPU-d7ff4b7a-730d-7ada-6fec-40ba709c649a,GPU-2774936a-d980-022c-69a5-ea4a6e081e6f]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-d7ff4b7a-730d-7ada-6fec-40ba709c649a,GPU-2774936a-d980-022c-69a5-ea4a6e081e6f]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [GPU-7c411134-8399-b08e-2f07-d45fdbff38c5,GPU-ce3d447b-109b-8294-c7a3-d4b1b36ede2e]

# Then Training starts correctly.
```
The problem is when the Scheduler assigns a different number of GPUS on each node, which may of course happen very often since I am not the only person using the cluster.
In the following example, 4 GPUS are assigned to 3 NODES (1, 2, 1).

```
# Note: some prints are from my bash script
assigned nodes: gnode41 gnode54 gnode60
assigned gpus per node: 1 2 1
WORLD_SIZE 3
MASTER gnode41
PORT 25212
bash script here! Waiting for all jobs to finish...
##############################################################################
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/3
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/3
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/6
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/6
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 3 processes
----------------------------------------------------------------------------------------------------
# just hanging forever...
```
As you can see, the `GLOBAL_RANK` and `MEMBER` assigned are wrong.
In particular, for `MEMBER` the node with 2 gpus assumes a total number of processes of 6 (2 times `WORLD SIZE`), while the nodes with one GPU assigned assume a total number of processes of 3 (1 times `WORLD_SIZE`).

Additional Note: My cluster uses **PBS** jobs scheduler, so I do not know if the same happens with e.g. **SLURM**.  

Thank you


### How to reproduce the bug

_No response_

### Error messages and logs

There are no error messages, since the process hangs forever.

### Environment

```

#- PyTorch Lightning Version (e.g., 1.5.0): 1.7.7
#- PyTorch Version (e.g., 1.10): 1.11.0
#- Python version (e.g., 3.9):  3.9
#- OS (e.g., Linux): Linux 
#- CUDA/cuDNN version: 11.7
#- GPU models and configuration: nvidia A100
#- How you installed Lightning(`conda`, `pip`, source): pip

```


### More info

_No response_

cc @borda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DDP fails on multinode if each node has a different number of GPUS #15874

Bug description

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DDP fails on multinode if each node has a different number of GPUS #15874

Description

Bug description

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions