-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
Currently testing BoringModel on LSF GPU cluster with the following training configuration:
trainer = Trainer(
gpus=4,
default_root_dir=os.getcwd(),
max_epochs=4,
accelerator="ddp",
limit_train_batches=100,
limit_val_batches=100,
weights_summary=None,
)
As I understand the batch_size here is 1 by default and since we are using 4 GPUS, number of expected iterations is 16.
train_data = torch.utils.data.DataLoader(RandomDataset(32, 64))
Here is the output of the BoringModel, as you can see 50% into training validation starts (which is actually expected based on the number of iterations). This pattern is same for the rest of the epochs.
Epoch 0: 25%|██▌ | 8/32 [00:00<00:00, 49.28it/s, loss=0.87, v_num=15]
Epoch 0: 28%|██▊ | 9/32 [00:00<00:00, 54.10it/s, loss=0.85, v_num=15]
Epoch 0: 31%|███▏ | 10/32 [00:00<00:00, 58.54it/s, loss=0.804, v_num=15]
Epoch 0: 34%|███▍ | 11/32 [00:00<00:00, 63.03it/s, loss=0.742, v_num=15]
Epoch 0: 38%|███▊ | 12/32 [00:00<00:00, 67.24it/s, loss=0.681, v_num=15]
Epoch 0: 41%|████ | 13/32 [00:00<00:00, 71.28it/s, loss=0.636, v_num=15]
Epoch 0: 44%|████▍ | 14/32 [00:00<00:00, 75.12it/s, loss=0.607, v_num=15]
Epoch 0: 47%|████▋ | 15/32 [00:00<00:00, 78.83it/s, loss=0.569, v_num=15]
Epoch 0: 50%|█████ | 16/32 [00:00<00:00, 80.90it/s, loss=0.539, v_num=15]
Validating: 0it [00:00, ?it/s]�[A
Validating: 0%| | 0/16 [00:00<?, ?it/s]�[A
Epoch 0: 100%|██████████| 32/32 [00:00<00:00, 149.91it/s, loss=0.539, v_num=15]
�[A
Epoch 0: 0%| | 0/32 [00:00<?, ?it/s, loss=0.539, v_num=15]
Epoch 1: 0%| | 0/32 [00:00<?, ?it/s, loss=0.539, v_num=15]
We also observed this behaviour on our own model with our dataset. Similar to above, the number of expected iterations (4GPUs) for our dataset and batch size (16) config is 2621 - here in the progress bar it is 4224. We observe that validation starts at 62% which is around the expected number of iterations. On the other hand, training seems to be picking up from 62% continues along with the validation.
The interesting thing is if we use DDP
as accelerator the experiment is stuck at 62% of last epoch and the validation never starts. If we use DDP_SPAWN
then the experiment terminates successfully.
Epoch 0: 62%|██████▏ | 2618/4224 [26:39<16:21, 1.64it/s, loss=4.42, v_num=30]
Epoch 0: 62%|██████▏ | 2618/4224 [26:39<16:21, 1.64it/s, loss=4.42, v_num=30]
Epoch 0: 62%|██████▏ | 2619/4224 [26:40<16:20, 1.64it/s, loss=4.42, v_num=30]
Epoch 0: 62%|██████▏ | 2619/4224 [26:40<16:20, 1.64it/s, loss=4.42, v_num=30]
Validating: 0it [00:00, ?it/s][A
Validating: 0%| | 0/1605 [00:00<?, ?it/s][A
Validating: 0%| | 1/1605 [00:00<15:57, 1.68it/s][A
Epoch 0: 62%|██████▏ | 2621/4224 [26:40<16:18, 1.64it/s, loss=4.42, v_num=30]
Validating: 0%| | 2/1605 [00:01<15:20, 1.74it/s][A
Validating: 0%| | 3/1605 [00:01<15:48, 1.69it/s][A
Epoch 0: 62%|██████▏ | 2623/4224 [26:41<16:17, 1.64it/s, loss=4.42, v_num=30]
Validating: 0%| | 4/1605 [00:02<14:16, 1.87it/s][A
Validating: 0%| | 5/1605 [00:02<13:55, 1.92it/s][A
Epoch 0: 62%|██████▏ | 2625/4224 [26:42<16:16, 1.64it/s, loss=4.42, v_num=30]
Validating: 0%| | 6/1605 [00:03<13:56, 1.91it/s][A
Validating: 0%| | 7/1605 [00:03<12:40, 2.10it/s][A
Epoch 0: 62%|██████▏ | 2627/4224 [26:43<16:14, 1.64it/s, loss=4.42, v_num=30]
Validating: 0%| | 8/1605 [00:04<14:00, 1.90it/s][A
Validating: 1%| | 9/1605 [00:04<11:49, 2.25it/s][A
Epoch 0: 62%|██████▏ | 2629/4224 [26:44<16:13, 1.64it/s, loss=4.42, v_num=30]
Validating: 1%| | 10/1605 [00:04<10:37, 2.50it/s][A
Validating: 1%| | 11/1605 [00:05<10:45, 2.47it/s][A
Epoch 0: 62%|██████▏ | 2631/4224 [26:45<16:11, 1.64it/s, loss=4.42, v_num=30]
…
Validating: 1%| | 20/1605 [00:07<06:07, 4.31it/s][A
Validating: 1%|▏ | 21/1605 [00:07<05:56, 4.44it/s][A
Epoch 0: 63%|██████▎ | 2641/4224 [26:47<16:03, 1.64it/s, loss=4.42, v_num=30]
Validating: 1%|▏ | 22/1605 [00:07<05:41, 4.64it/s][A
Validating: 1%|▏ | 23/1605 [00:08<05:21, 4.92it/s][A
Epoch 0: 63%|██████▎ | 2643/4224 [26:48<16:01, 1.64it/s, loss=4.42, v_num=30]
Validating: 1%|▏ | 24/1605 [00:08<05:07, 5.14it/s][A
Validating: 2%|▏ | 25/1605 [00:08<05:01, 5.24it/s][A
Epoch 0: 63%|██████▎ | 2645/4224 [26:48<16:00, 1.64it/s, loss=4.42, v_num=30]
Validating: 2%|▏ | 26/1605 [00:08<04:58, 5.30it/s][A
Validating: 2%|▏ | 27/1605 [00:08<05:04, 5.18it/s][A
Epoch 0: 63%|██████▎ | 2647/4224 [26:48<15:58, 1.65it/s, loss=4.42, v_num=30]
Validating: 2%|▏ | 28/1605 [00:09<04:58, 5.28it/s][A
Validating: 2%|▏ | 29/1605 [00:09<04:51, 5.41it/s][A
Epoch 0: 63%|██████▎ | 2649/4224 [26:49<15:56, 1.65it/s, loss=4.42, v_num=30]
Should we assume other than progress bar issue, we can safely proceed with the experiments?
CUDA:
GPU: Quadro RTX 6000
available: True
version: 10.2
Packages:
numpy: 1.17.0
pyTorch_debug: True
pyTorch_version: 1.7.0
pytorch-lightning: 1.2.1
tqdm: 4.58.0
System:
OS: Linux
architecture: 64bit
processor: x86_64
python: 3.7.10