-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed as not planned
Closed as not planned
Copy link
Labels
questionFurther information is requestedFurther information is requestedstrategy: ddpDistributedDataParallelDistributedDataParallel
Description
🐛 Bug
I have four nodes
g-1-0: 2*A100
g-1-1: 4*A40
g-1-2: 8*3090
g-1-3: 8*3090
I get error:
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=8, worker_count=2, timeout=0:30:00)
Guess it may be because the wrong world size is obtained.
To Reproduce
trainer = pl.Trainer(devices=-1, num_nodes=4, strategy="fsdp", accelerator='gpu')
Expected behavior
Environment
- PyTorch Lightning Version (e.g., 1.5.0): 1.5.10
- PyTorch Version (e.g., 1.10): 1.10
- Python version (e.g., 3.9): 3.8
- OS (e.g., Linux): Linux
- CUDA/cuDNN version: 11.2
- GPU models and configuration:
- How you installed PyTorch (
conda
,pip
, source): conda - If compiling from source, the output of
torch.__config__.show()
: - Any other relevant information:
Additional context
cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requestedstrategy: ddpDistributedDataParallelDistributedDataParallel