Skip to content

Multi-node training freezes during ddp initialization #8707

@catalys1

Description

@catalys1

🐛 Bug

I'm trying to do multi-node training using SLURM. The job starts up, but it freezes during ddp setup. I'm trying to use 2 nodes with 4 GPUs each. It seems like it is able to get 4 GPUs initialized, and then hangs waiting for the rest. Here's the output I get:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/8
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/8
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/8

And this is the output of squeue

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
42859712       m9g      job.sh   catalys1  R         0:25   2 m9g-1-[3-4]

To Reproduce

Here's a minimal test case. I'm only able to test this on the university system.

Training code.

import logging
import os
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
import torchvision.transforms as transforms

import pytorch_lightning as ptl

def get_logger(name=__name__, level=logging.INFO):
    """Initializes python logger."""

    logger = logging.getLogger(name)
    logger.setLevel(level)

    # this ensures all logging levels get marked with the rank zero decorator
    # otherwise logs would get multiplied for each GPU process in multi-GPU setup
    for level in ("debug", "info", "warning", "error", "exception", "fatal", "critical"):
        setattr(logger, level, ptl.utilities.rank_zero_only(getattr(logger, level)))

    return logger


log = get_logger(__name__)

class CoolModel(ptl.LightningModule):

    def __init__(self):
        super(CoolModel, self).__init__()
        # not the best model...
        self.l1 = torch.nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def my_loss(self, y_hat, y):
        return F.cross_entropy(y_hat, y)

    def training_step(self, batch, batch_nb):
        x, y = batch
        y_hat = self.forward(x)
        return {'loss': self.my_loss(y_hat, y)}

    def validation_step(self, batch, batch_nb):
        x, y = batch
        y_hat = self.forward(x)
        return {'val_loss': self.my_loss(y_hat, y)}

    def validation_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        return {'avg_val_loss': avg_loss}

    def configure_optimizers(self):
        return [torch.optim.Adam(self.parameters(), lr=0.02)]

    def train_dataloader(self):
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)

    def val_dataloader(self):
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)

    def test_dataloader(self):
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)


from pytorch_lightning import Trainer
from test_tube import Experiment

model = CoolModel()

# train on cpu using only 10% of the data and limit to 1 epoch (for demo purposes)
trainer = Trainer(max_epochs=1, gpus=[0,1,2,3], num_nodes=2, accelerator='ddp')

trainer.fit(model)

And here's my SLURM job file.

#!/bin/bash

#SBATCH --time=00:10:00   # walltime
#SBATCH --ntasks-per-node=12   # number of processor cores (i.e. tasks)
#SBATCH --gpus=8
#SBATCH --mem-per-cpu=1536M   # memory per CPU core
#SBATCH --nodes=2
#SBATCH -C pascal
#SBATCH --qos=test

# Set the max number of threads to use for programs using OpenMP. Should be <= ppn. Does nothing if the program doesn't use OpenMP.
export OMP_NUM_THREADS=$SLURM_CPUS_ON_NODE

# LOAD MODULES, INSERT CODE, AND RUN YOUR PROGRAMS HERE
module load python/3.8
module load cuda/10.1

source ~/venv/bin/activate

cd ~/test/

python test.py

Expected behavior

Able to train successfully on multiple nodes

Environment

PyTorch Lightning Version: 1.4.0
PyTorch Version: 1.9.0
Python Version: 3.8
OS: RedHat Linux
CUDA Version: 10.1
GPU models: 4xP100 per node
Installed PyTorch via pip

Additional context

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions