[RFC] Tuner Revamp

## Proposed refactor

### Issues
1. The Tuner has been causing a lot of issues in the past and we plan to refactor it. The primary reason comes from the trainer state [snapshotting and restoration](https://github.com/PyTorchLightning/pytorch-lightning/blob/3fcfd0214cc95fa7de49f6404421b5c6cd113c57/pytorch_lightning/tuner/batch_size_scaling.py#L102-L140).
2. Auto batch size scaling doesn't work with validate/test/predict. Users might want to identify an optimal batch_size for inference to better utilize their available compute resources.
3. [LR Finder suggestion](https://github.com/PyTorchLightning/pytorch-lightning/blob/3fcfd0214cc95fa7de49f6404421b5c6cd113c57/pytorch_lightning/tuner/lr_finder.py#L164-L182) is not optimal. Sometimes it suggests bad LR as per its algorithm, and sometimes [it doesn't suggest anything at all](https://github.com/PyTorchLightning/pytorch-lightning/blob/3fcfd0214cc95fa7de49f6404421b5c6cd113c57/pytorch_lightning/tuner/lr_finder.py#L180-L182).
4. Doesn't work with [flash finetuning](https://lightning-flash.readthedocs.io/en/latest/general/finetuning.html#finetuning-in-flash). For eg. let's say the user might want to compute new LR or new batch_size after certain epochs of pre-training, then it's not easily configurable within a single call. One can achieve it with multiple calls but since we support strategies within Flash, this might be worth adding.

ps: please add more issues up here if you have any regarding the tuner.

### Possible solutions
- We can subclass Trainer for tuner and create independent states so that we don't do any sort of snapshotting and restoration with trainer states and it will stay independent.
```py
class Tuner(Trainer):
     # create independent states
     # create custom loops

trainer.tuner(auto_scale_batch_size=..., auto_lr_find=...).fit()
trainer.tuner(auto_scale_batch_size=...).predict()
```
well, this solution could possibly solve `1` & `2` but possibly can't be configured to solve `4`.

- Another solution proposed by @Borda is to make them as callbacks, so that they can be easily configured by users independently and can help resolve `4`. But this solution might not resolve `1` & `2`.

- Another solution @Borda and @SkafteNicki suggested, for now, is to move lr_finder to bolts and experiment there and improve scale_batch_size within lightning. But possibly it can't guarantee to solve `4`.

### Additional context

Other issues with the tuner right now we need to address:
https://github.com/PyTorchLightning/pytorch-lightning/issues/9625
https://github.com/PyTorchLightning/pytorch-lightning/issues/10560
https://github.com/PyTorchLightning/pytorch-lightning/issues/10557

thanks to @Borda @SkafteNicki @ethanwharris @akihironitta for helping out with the discussion and possible solutions.

cc @justusschock @awaelchli @akihironitta @borda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Tuner Revamp #11012

Proposed refactor

Issues

Possible solutions

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Tuner Revamp #11012

Description

Proposed refactor

Issues

Possible solutions

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions