-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingenvironment: slurmhelp wantedOpen to be worked onOpen to be worked on
Description
🐛 Bug
I'm trying to do multi-node training using SLURM. The job starts up, but it freezes during ddp
setup. I'm trying to use 2 nodes with 4 GPUs each. It seems like it is able to get 4 GPUs initialized, and then hangs waiting for the rest. Here's the output I get:
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/8
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/8
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/8
And this is the output of squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
42859712 m9g job.sh catalys1 R 0:25 2 m9g-1-[3-4]
To Reproduce
Here's a minimal test case. I'm only able to test this on the university system.
Training code.
import logging
import os
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
import torchvision.transforms as transforms
import pytorch_lightning as ptl
def get_logger(name=__name__, level=logging.INFO):
"""Initializes python logger."""
logger = logging.getLogger(name)
logger.setLevel(level)
# this ensures all logging levels get marked with the rank zero decorator
# otherwise logs would get multiplied for each GPU process in multi-GPU setup
for level in ("debug", "info", "warning", "error", "exception", "fatal", "critical"):
setattr(logger, level, ptl.utilities.rank_zero_only(getattr(logger, level)))
return logger
log = get_logger(__name__)
class CoolModel(ptl.LightningModule):
def __init__(self):
super(CoolModel, self).__init__()
# not the best model...
self.l1 = torch.nn.Linear(28 * 28, 10)
def forward(self, x):
return torch.relu(self.l1(x.view(x.size(0), -1)))
def my_loss(self, y_hat, y):
return F.cross_entropy(y_hat, y)
def training_step(self, batch, batch_nb):
x, y = batch
y_hat = self.forward(x)
return {'loss': self.my_loss(y_hat, y)}
def validation_step(self, batch, batch_nb):
x, y = batch
y_hat = self.forward(x)
return {'val_loss': self.my_loss(y_hat, y)}
def validation_end(self, outputs):
avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
return {'avg_val_loss': avg_loss}
def configure_optimizers(self):
return [torch.optim.Adam(self.parameters(), lr=0.02)]
def train_dataloader(self):
return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)
def val_dataloader(self):
return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)
def test_dataloader(self):
return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)
from pytorch_lightning import Trainer
from test_tube import Experiment
model = CoolModel()
# train on cpu using only 10% of the data and limit to 1 epoch (for demo purposes)
trainer = Trainer(max_epochs=1, gpus=[0,1,2,3], num_nodes=2, accelerator='ddp')
trainer.fit(model)
And here's my SLURM job file.
#!/bin/bash
#SBATCH --time=00:10:00 # walltime
#SBATCH --ntasks-per-node=12 # number of processor cores (i.e. tasks)
#SBATCH --gpus=8
#SBATCH --mem-per-cpu=1536M # memory per CPU core
#SBATCH --nodes=2
#SBATCH -C pascal
#SBATCH --qos=test
# Set the max number of threads to use for programs using OpenMP. Should be <= ppn. Does nothing if the program doesn't use OpenMP.
export OMP_NUM_THREADS=$SLURM_CPUS_ON_NODE
# LOAD MODULES, INSERT CODE, AND RUN YOUR PROGRAMS HERE
module load python/3.8
module load cuda/10.1
source ~/venv/bin/activate
cd ~/test/
python test.py
Expected behavior
Able to train successfully on multiple nodes
Environment
PyTorch Lightning Version: 1.4.0
PyTorch Version: 1.9.0
Python Version: 3.8
OS: RedHat Linux
CUDA Version: 10.1
GPU models: 4xP100 per node
Installed PyTorch via pip
Additional context
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingenvironment: slurmhelp wantedOpen to be worked onOpen to be worked on