Skip to content

Distributed training of multiple nodes with different GPU numbers #13506

@wangleiofficial

Description

@wangleiofficial

🐛 Bug

I have four nodes

g-1-0: 2*A100
g-1-1: 4*A40
g-1-2: 8*3090
g-1-3: 8*3090

I get error:

RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=8, worker_count=2, timeout=0:30:00)

Guess it may be because the wrong world size is obtained.

To Reproduce

trainer = pl.Trainer(devices=-1, num_nodes=4, strategy="fsdp", accelerator='gpu')

Expected behavior

Environment

  • PyTorch Lightning Version (e.g., 1.5.0): 1.5.10
  • PyTorch Version (e.g., 1.10): 1.10
  • Python version (e.g., 3.9): 3.8
  • OS (e.g., Linux): Linux
  • CUDA/cuDNN version: 11.2
  • GPU models and configuration:
  • How you installed PyTorch (conda, pip, source): conda
  • If compiling from source, the output of torch.__config__.show():
  • Any other relevant information:

Additional context

cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7

Metadata

Metadata

Assignees

Labels

questionFurther information is requestedstrategy: ddpDistributedDataParallel

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions