Skip to content

DDP fails on multinode if each node has a different number of GPUS #15874

@SerezD

Description

@SerezD

Bug description

Hi there,

I'm launching a PyTorch-Lightning script in a multinode environment.
In order to do so, I have followed the suggestions at this link:
https://pytorch-lightning.readthedocs.io/en/stable/clouds/cluster_intermediate_1.html

I have configured a bash script that gets the number of nodes assigned and the number of GPUS on each node, as well as the MASTER NODE, MASTER ADDRESS, etc...

In the bash script I launch the mpirun command on every node, like:

mpirun -np 1 -H $NODE $EXEC $SCRIPT &

where $NODE is the current node, $EXEC is the python3 command, and $SCRIPT is my python script.

In the python script (which is launched N nodes times), I correctly assign the variables:

os.environ["MASTER_ADDR"] = args.master
os.environ["MASTER_PORT"] = args.port
os.environ["WORLD_SIZE"] = args.world
os.environ["NODE_RANK"] = args.rank

Then, in the trainer:

trainer = pl.Trainer(strategy=DDPStrategy(find_unused_parameters=False), accelerator='gpu', devices=gpus, num_nodes=world_size)

Everything works fine if the Scheduler assigns the same number of GPUS at each Node.
(2 GPUS on 4 NODES in the following example, with 8 gpus in total)

# Note: some prints are from my bash script
assigned nodes: gnode10 gnode16 gnode31 gnode32
assigned gpus per node: 2 2 2 2
WORLD_SIZE 4
MASTER gnode10
PORT 11551
bash script here! Waiting for all jobs to finish...
#############################################################################
# Here start the 4 scripts launched by mpirun (in this case WORLD SIZE = 4)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-e2d7ffc6-fe40-bbff-cd8b-77571d64b1bc,GPU-ec7f5880-fe70-668a-2fe9-6b51118ff2ab]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [GPU-e2d7ffc6-fe40-bbff-cd8b-77571d64b1bc,GPU-ec7f5880-fe70-668a-2fe9-6b51118ff2ab]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-80c32807-6dbf-8b68-4740-086cece4b91a,GPU-0148835f-d43f-b272-72d2-38a6dea989d2]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [GPU-80c32807-6dbf-8b68-4740-086cece4b91a,GPU-0148835f-d43f-b272-72d2-38a6dea989d2]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-7c411134-8399-b08e-2f07-d45fdbff38c5,GPU-ce3d447b-109b-8294-c7a3-d4b1b36ede2e]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [GPU-d7ff4b7a-730d-7ada-6fec-40ba709c649a,GPU-2774936a-d980-022c-69a5-ea4a6e081e6f]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-d7ff4b7a-730d-7ada-6fec-40ba709c649a,GPU-2774936a-d980-022c-69a5-ea4a6e081e6f]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [GPU-7c411134-8399-b08e-2f07-d45fdbff38c5,GPU-ce3d447b-109b-8294-c7a3-d4b1b36ede2e]

# Then Training starts correctly.

The problem is when the Scheduler assigns a different number of GPUS on each node, which may of course happen very often since I am not the only person using the cluster.
In the following example, 4 GPUS are assigned to 3 NODES (1, 2, 1).

# Note: some prints are from my bash script
assigned nodes: gnode41 gnode54 gnode60
assigned gpus per node: 1 2 1
WORLD_SIZE 3
MASTER gnode41
PORT 25212
bash script here! Waiting for all jobs to finish...
##############################################################################
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/3
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/3
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/6
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/6
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 3 processes
----------------------------------------------------------------------------------------------------
# just hanging forever...

As you can see, the GLOBAL_RANK and MEMBER assigned are wrong.
In particular, for MEMBER the node with 2 gpus assumes a total number of processes of 6 (2 times WORLD SIZE), while the nodes with one GPU assigned assume a total number of processes of 3 (1 times WORLD_SIZE).

Additional Note: My cluster uses PBS jobs scheduler, so I do not know if the same happens with e.g. SLURM.

Thank you

How to reproduce the bug

No response

Error messages and logs

There are no error messages, since the process hangs forever.

Environment


#- PyTorch Lightning Version (e.g., 1.5.0): 1.7.7
#- PyTorch Version (e.g., 1.10): 1.11.0
#- Python version (e.g., 3.9):  3.9
#- OS (e.g., Linux): Linux 
#- CUDA/cuDNN version: 11.7
#- GPU models and configuration: nvidia A100
#- How you installed Lightning(`conda`, `pip`, source): pip

More info

No response

cc @Borda

Metadata

Metadata

Assignees

No one assigned

    Labels

    docsDocumentation relatedquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions