Multi-node training freezes during ddp initialization

## 🐛 Bug

I'm trying to do multi-node training using SLURM. The job starts up, but it freezes during `ddp` setup. I'm trying to use 2 nodes with 4 GPUs each. It seems like it is able to get 4 GPUs initialized, and then hangs waiting for the rest. Here's the output I get:
```
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/8
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/8
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/8
```
And this is the output of `squeue`
```
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
42859712       m9g      job.sh   catalys1  R         0:25   2 m9g-1-[3-4]
```
### To Reproduce

Here's a minimal test case. I'm only able to test this on the university system.

Training code.
```python
import logging
import os
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
import torchvision.transforms as transforms

import pytorch_lightning as ptl

def get_logger(name=__name__, level=logging.INFO):
    """Initializes python logger."""

    logger = logging.getLogger(name)
    logger.setLevel(level)

    # this ensures all logging levels get marked with the rank zero decorator
    # otherwise logs would get multiplied for each GPU process in multi-GPU setup
    for level in ("debug", "info", "warning", "error", "exception", "fatal", "critical"):
        setattr(logger, level, ptl.utilities.rank_zero_only(getattr(logger, level)))

    return logger


log = get_logger(__name__)

class CoolModel(ptl.LightningModule):

    def __init__(self):
        super(CoolModel, self).__init__()
        # not the best model...
        self.l1 = torch.nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def my_loss(self, y_hat, y):
        return F.cross_entropy(y_hat, y)

    def training_step(self, batch, batch_nb):
        x, y = batch
        y_hat = self.forward(x)
        return {'loss': self.my_loss(y_hat, y)}

    def validation_step(self, batch, batch_nb):
        x, y = batch
        y_hat = self.forward(x)
        return {'val_loss': self.my_loss(y_hat, y)}

    def validation_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        return {'avg_val_loss': avg_loss}

    def configure_optimizers(self):
        return [torch.optim.Adam(self.parameters(), lr=0.02)]

    def train_dataloader(self):
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)

    def val_dataloader(self):
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)

    def test_dataloader(self):
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)


from pytorch_lightning import Trainer
from test_tube import Experiment

model = CoolModel()

# train on cpu using only 10% of the data and limit to 1 epoch (for demo purposes)
trainer = Trainer(max_epochs=1, gpus=[0,1,2,3], num_nodes=2, accelerator='ddp')

trainer.fit(model)
```
And here's my SLURM job file.
```bash
#!/bin/bash

#SBATCH --time=00:10:00   # walltime
#SBATCH --ntasks-per-node=12   # number of processor cores (i.e. tasks)
#SBATCH --gpus=8
#SBATCH --mem-per-cpu=1536M   # memory per CPU core
#SBATCH --nodes=2
#SBATCH -C pascal
#SBATCH --qos=test

# Set the max number of threads to use for programs using OpenMP. Should be <= ppn. Does nothing if the program doesn't use OpenMP.
export OMP_NUM_THREADS=$SLURM_CPUS_ON_NODE

# LOAD MODULES, INSERT CODE, AND RUN YOUR PROGRAMS HERE
module load python/3.8
module load cuda/10.1

source ~/venv/bin/activate

cd ~/test/

python test.py

```

### Expected behavior

Able to train successfully on multiple nodes

### Environment

PyTorch Lightning Version: `1.4.0`
PyTorch Version: `1.9.0`
Python Version: `3.8`
OS: `RedHat Linux`
CUDA Version: `10.1`
GPU models: `4xP100 per node`
Installed PyTorch via pip



### Additional context

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-node training freezes during ddp initialization #8707

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-node training freezes during ddp initialization #8707

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions