Skip to content

Progress bar behaviour #6406

@hkmztrk

Description

@hkmztrk

🐛 Bug

Currently testing BoringModel on LSF GPU cluster with the following training configuration:

 trainer = Trainer(
        gpus=4,
        default_root_dir=os.getcwd(),
        max_epochs=4,
        accelerator="ddp",
        limit_train_batches=100,
        limit_val_batches=100,
        weights_summary=None,
    )

As I understand the batch_size here is 1 by default and since we are using 4 GPUS, number of expected iterations is 16.

train_data = torch.utils.data.DataLoader(RandomDataset(32, 64))

Here is the output of the BoringModel, as you can see 50% into training validation starts (which is actually expected based on the number of iterations). This pattern is same for the rest of the epochs.

Epoch 0:  25%|██▌       | 8/32 [00:00<00:00, 49.28it/s, loss=0.87, v_num=15] 
Epoch 0:  28%|██▊       | 9/32 [00:00<00:00, 54.10it/s, loss=0.85, v_num=15]
Epoch 0:  31%|███▏      | 10/32 [00:00<00:00, 58.54it/s, loss=0.804, v_num=15]
Epoch 0:  34%|███▍      | 11/32 [00:00<00:00, 63.03it/s, loss=0.742, v_num=15]
Epoch 0:  38%|███▊      | 12/32 [00:00<00:00, 67.24it/s, loss=0.681, v_num=15]
Epoch 0:  41%|████      | 13/32 [00:00<00:00, 71.28it/s, loss=0.636, v_num=15]
Epoch 0:  44%|████▍     | 14/32 [00:00<00:00, 75.12it/s, loss=0.607, v_num=15]
Epoch 0:  47%|████▋     | 15/32 [00:00<00:00, 78.83it/s, loss=0.569, v_num=15]
Epoch 0:  50%|█████     | 16/32 [00:00<00:00, 80.90it/s, loss=0.539, v_num=15]

Validating: 0it [00:00, ?it/s]�[A

Validating:   0%|          | 0/16 [00:00<?, ?it/s]�[A
Epoch 0: 100%|██████████| 32/32 [00:00<00:00, 149.91it/s, loss=0.539, v_num=15]

                                                  �[A
Epoch 0:   0%|          | 0/32 [00:00<?, ?it/s, loss=0.539, v_num=15]          
Epoch 1:   0%|          | 0/32 [00:00<?, ?it/s, loss=0.539, v_num=15]

We also observed this behaviour on our own model with our dataset. Similar to above, the number of expected iterations (4GPUs) for our dataset and batch size (16) config is 2621 - here in the progress bar it is 4224. We observe that validation starts at 62% which is around the expected number of iterations. On the other hand, training seems to be picking up from 62% continues along with the validation.

The interesting thing is if we use DDP as accelerator the experiment is stuck at 62% of last epoch and the validation never starts. If we use DDP_SPAWN then the experiment terminates successfully.

Epoch 0:  62%|██████▏   | 2618/4224 [26:39<16:21,  1.64it/s, loss=4.42, v_num=30]
Epoch 0:  62%|██████▏   | 2618/4224 [26:39<16:21,  1.64it/s, loss=4.42, v_num=30]
Epoch 0:  62%|██████▏   | 2619/4224 [26:40<16:20,  1.64it/s, loss=4.42, v_num=30]
Epoch 0:  62%|██████▏   | 2619/4224 [26:40<16:20,  1.64it/s, loss=4.42, v_num=30]
Validating: 0it [00:00, ?it/s][A
Validating:   0%|          | 0/1605 [00:00<?, ?it/s][A
Validating:   0%|          | 1/1605 [00:00<15:57,  1.68it/s][A
Epoch 0:  62%|██████▏   | 2621/4224 [26:40<16:18,  1.64it/s, loss=4.42, v_num=30]
Validating:   0%|          | 2/1605 [00:01<15:20,  1.74it/s][A
Validating:   0%|          | 3/1605 [00:01<15:48,  1.69it/s][A
Epoch 0:  62%|██████▏   | 2623/4224 [26:41<16:17,  1.64it/s, loss=4.42, v_num=30]
Validating:   0%|          | 4/1605 [00:02<14:16,  1.87it/s][A
Validating:   0%|          | 5/1605 [00:02<13:55,  1.92it/s][A
Epoch 0:  62%|██████▏   | 2625/4224 [26:42<16:16,  1.64it/s, loss=4.42, v_num=30]
Validating:   0%|          | 6/1605 [00:03<13:56,  1.91it/s][A
Validating:   0%|          | 7/1605 [00:03<12:40,  2.10it/s][A
Epoch 0:  62%|██████▏   | 2627/4224 [26:43<16:14,  1.64it/s, loss=4.42, v_num=30]
Validating:   0%|          | 8/1605 [00:04<14:00,  1.90it/s][A
Validating:   1%|          | 9/1605 [00:04<11:49,  2.25it/s][A
Epoch 0:  62%|██████▏   | 2629/4224 [26:44<16:13,  1.64it/s, loss=4.42, v_num=30]
Validating:   1%|          | 10/1605 [00:04<10:37,  2.50it/s][A
Validating:   1%|          | 11/1605 [00:05<10:45,  2.47it/s][A
Epoch 0:  62%|██████▏   | 2631/4224 [26:45<16:11,  1.64it/s, loss=4.42, v_num=30]
…
Validating:   1%|          | 20/1605 [00:07<06:07,  4.31it/s][A
Validating:   1%|▏         | 21/1605 [00:07<05:56,  4.44it/s][A
Epoch 0:  63%|██████▎   | 2641/4224 [26:47<16:03,  1.64it/s, loss=4.42, v_num=30]
Validating:   1%|▏         | 22/1605 [00:07<05:41,  4.64it/s][A
Validating:   1%|▏         | 23/1605 [00:08<05:21,  4.92it/s][A
Epoch 0:  63%|██████▎   | 2643/4224 [26:48<16:01,  1.64it/s, loss=4.42, v_num=30]
Validating:   1%|▏         | 24/1605 [00:08<05:07,  5.14it/s][A
Validating:   2%|▏         | 25/1605 [00:08<05:01,  5.24it/s][A
Epoch 0:  63%|██████▎   | 2645/4224 [26:48<16:00,  1.64it/s, loss=4.42, v_num=30]
Validating:   2%|▏         | 26/1605 [00:08<04:58,  5.30it/s][A
Validating:   2%|▏         | 27/1605 [00:08<05:04,  5.18it/s][A
Epoch 0:  63%|██████▎   | 2647/4224 [26:48<15:58,  1.65it/s, loss=4.42, v_num=30]
Validating:   2%|▏         | 28/1605 [00:09<04:58,  5.28it/s][A
Validating:   2%|▏         | 29/1605 [00:09<04:51,  5.41it/s][A
Epoch 0:  63%|██████▎   | 2649/4224 [26:49<15:56,  1.65it/s, loss=4.42, v_num=30]

Should we assume other than progress bar issue, we can safely proceed with the experiments?

CUDA:
 GPU: Quadro RTX 6000
 available: True
 version: 10.2
Packages:
 numpy: 1.17.0
 pyTorch_debug: True
 pyTorch_version: 1.7.0
 pytorch-lightning: 1.2.1
 tqdm: 4.58.0
System:
 OS: Linux
 architecture: 64bit
 processor: x86_64
 python: 3.7.10

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdistributedGeneric distributed-related topichelp wantedOpen to be worked onpriority: 0High priority taskworking as intendedWorking as intended

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions