-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Bug description
Hi there,
I'm launching a PyTorch-Lightning script in a multinode environment.
In order to do so, I have followed the suggestions at this link:
https://pytorch-lightning.readthedocs.io/en/stable/clouds/cluster_intermediate_1.html
I have configured a bash script that gets the number of nodes assigned and the number of GPUS on each node, as well as the MASTER NODE, MASTER ADDRESS, etc...
In the bash script I launch the mpirun command on every node, like:
mpirun -np 1 -H $NODE $EXEC $SCRIPT &
where $NODE
is the current node, $EXEC
is the python3 command, and $SCRIPT
is my python script.
In the python script (which is launched N nodes times), I correctly assign the variables:
os.environ["MASTER_ADDR"] = args.master
os.environ["MASTER_PORT"] = args.port
os.environ["WORLD_SIZE"] = args.world
os.environ["NODE_RANK"] = args.rank
Then, in the trainer:
trainer = pl.Trainer(strategy=DDPStrategy(find_unused_parameters=False), accelerator='gpu', devices=gpus, num_nodes=world_size)
Everything works fine if the Scheduler assigns the same number of GPUS at each Node.
(2 GPUS on 4 NODES in the following example, with 8 gpus in total)
# Note: some prints are from my bash script
assigned nodes: gnode10 gnode16 gnode31 gnode32
assigned gpus per node: 2 2 2 2
WORLD_SIZE 4
MASTER gnode10
PORT 11551
bash script here! Waiting for all jobs to finish...
#############################################################################
# Here start the 4 scripts launched by mpirun (in this case WORLD SIZE = 4)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-e2d7ffc6-fe40-bbff-cd8b-77571d64b1bc,GPU-ec7f5880-fe70-668a-2fe9-6b51118ff2ab]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [GPU-e2d7ffc6-fe40-bbff-cd8b-77571d64b1bc,GPU-ec7f5880-fe70-668a-2fe9-6b51118ff2ab]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-80c32807-6dbf-8b68-4740-086cece4b91a,GPU-0148835f-d43f-b272-72d2-38a6dea989d2]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [GPU-80c32807-6dbf-8b68-4740-086cece4b91a,GPU-0148835f-d43f-b272-72d2-38a6dea989d2]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-7c411134-8399-b08e-2f07-d45fdbff38c5,GPU-ce3d447b-109b-8294-c7a3-d4b1b36ede2e]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [GPU-d7ff4b7a-730d-7ada-6fec-40ba709c649a,GPU-2774936a-d980-022c-69a5-ea4a6e081e6f]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [GPU-d7ff4b7a-730d-7ada-6fec-40ba709c649a,GPU-2774936a-d980-022c-69a5-ea4a6e081e6f]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [GPU-7c411134-8399-b08e-2f07-d45fdbff38c5,GPU-ce3d447b-109b-8294-c7a3-d4b1b36ede2e]
# Then Training starts correctly.
The problem is when the Scheduler assigns a different number of GPUS on each node, which may of course happen very often since I am not the only person using the cluster.
In the following example, 4 GPUS are assigned to 3 NODES (1, 2, 1).
# Note: some prints are from my bash script
assigned nodes: gnode41 gnode54 gnode60
assigned gpus per node: 1 2 1
WORLD_SIZE 3
MASTER gnode41
PORT 25212
bash script here! Waiting for all jobs to finish...
##############################################################################
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/3
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/3
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/6
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/6
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 3 processes
----------------------------------------------------------------------------------------------------
# just hanging forever...
As you can see, the GLOBAL_RANK
and MEMBER
assigned are wrong.
In particular, for MEMBER
the node with 2 gpus assumes a total number of processes of 6 (2 times WORLD SIZE
), while the nodes with one GPU assigned assume a total number of processes of 3 (1 times WORLD_SIZE
).
Additional Note: My cluster uses PBS jobs scheduler, so I do not know if the same happens with e.g. SLURM.
Thank you
How to reproduce the bug
No response
Error messages and logs
There are no error messages, since the process hangs forever.
Environment
#- PyTorch Lightning Version (e.g., 1.5.0): 1.7.7
#- PyTorch Version (e.g., 1.10): 1.11.0
#- Python version (e.g., 3.9): 3.9
#- OS (e.g., Linux): Linux
#- CUDA/cuDNN version: 11.7
#- GPU models and configuration: nvidia A100
#- How you installed Lightning(`conda`, `pip`, source): pip
More info
No response
cc @Borda