scale_batch_size does not work anymore?

## 🐛 Bug

After an update of PyTorch Lightning, the batch size scaling of `trainer.tune` is not working anymore: It just exhausts the maximum configured trials without actually probing the training loop. This results in incorrect batch sizes that cause Cuda OOM errors.

After some code diving I think this is due to a bug in the method [`_run_power_scaling`](https://github.com/Lightning-AI/lightning/blob/master/src/pytorch_lightning/tuner/batch_size_scaling.py#L131) that sets `trainer.fit_loop.global_step = 0` instead of `trainer.fit_loop.epoch_loop.global_step = 0`, which seems to be required after a [recent change](https://github.com/Lightning-AI/lightning/commit/aea96e45a416838265eff0056b56647d57f605ee#diff-a3f9154e7c08e7e6f2c703a4a483679193a2f4268cf1cf495d466a872e7cd5fdR167) to `FitLoop`.

Edit:

The script below works as expected with `pytorch-lightning==1.5.9`.

### To Reproduce

```python
import os

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer, LightningDataModule


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringDatamodule(LightningDataModule):
    def __init__(self, batch_size):
        super().__init__()
        self.batch_size = batch_size

    def train_dataloader(self):
        return DataLoader(RandomDataset(1_000, 999_999), batch_size=self.batch_size)

    def val_dataloader(self):
        return DataLoader(RandomDataset(1_000, 999_999), batch_size=self.batch_size)


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(1_000, 10_000)
        self.batch_size = "unset"

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.parameters(), lr=0.1)


def run():
    model = BoringModel().to("cuda:0")
    trainer = Trainer(
            default_root_dir=os.getcwd(),
            limit_train_batches=1,
            limit_val_batches=1,
            limit_test_batches=1,
            num_sanity_val_steps=0,
            max_epochs=1,
            enable_model_summary=False,
            auto_scale_batch_size=True,
            gpus=1
    )

    # the model can be fitted with batch_size=1000 on a NVIDIA 2080Ti without OOM errors
    datamodule = BoringDatamodule(batch_size=1000)

    trainer.tune(model=model, datamodule=datamodule, scale_batch_size_kwargs={"init_val": 1_000, "max_trials": 10})

    print(f"fitting with {datamodule.batch_size=} (fails on a NVIDIA 2080Ti)")
    trainer.fit(model, datamodule=datamodule)


if __name__ == "__main__":
    run()
```

### Expected behavior

`trainer.tune` finds a batch size that does not provoke a Cuda OOM error.

### Environment

* CUDA:
        - GPU:
                - NVIDIA GeForce RTX 2080 Ti
        - available:         True
        - version:           11.3
* Packages:
        - numpy:             1.21.6
        - pyTorch_debug:     False
        - pyTorch_version:   1.12.0+cu113
        - pytorch-lightning: 1.6.4
        - tqdm:              4.64.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.8.10
        - version:           127-Ubuntu SMP Wed May 18 14:30:56 UTC 2022



cc @akihironitta @borda @rohitgr7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scale_batch_size does not work anymore? #13696

🐛 Bug

To Reproduce

Expected behavior

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

scale_batch_size does not work anymore? #13696

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions