Skip to content

[RFC] Tuner Revamp #11012

@rohitgr7

Description

@rohitgr7

Proposed refactor

Issues

  1. The Tuner has been causing a lot of issues in the past and we plan to refactor it. The primary reason comes from the trainer state snapshotting and restoration.
  2. Auto batch size scaling doesn't work with validate/test/predict. Users might want to identify an optimal batch_size for inference to better utilize their available compute resources.
  3. LR Finder suggestion is not optimal. Sometimes it suggests bad LR as per its algorithm, and sometimes it doesn't suggest anything at all.
  4. Doesn't work with flash finetuning. For eg. let's say the user might want to compute new LR or new batch_size after certain epochs of pre-training, then it's not easily configurable within a single call. One can achieve it with multiple calls but since we support strategies within Flash, this might be worth adding.

ps: please add more issues up here if you have any regarding the tuner.

Possible solutions

  • We can subclass Trainer for tuner and create independent states so that we don't do any sort of snapshotting and restoration with trainer states and it will stay independent.
class Tuner(Trainer):
     # create independent states
     # create custom loops

trainer.tuner(auto_scale_batch_size=..., auto_lr_find=...).fit()
trainer.tuner(auto_scale_batch_size=...).predict()

well, this solution could possibly solve 1 & 2 but possibly can't be configured to solve 4.

  • Another solution proposed by @Borda is to make them as callbacks, so that they can be easily configured by users independently and can help resolve 4. But this solution might not resolve 1 & 2.

  • Another solution @Borda and @SkafteNicki suggested, for now, is to move lr_finder to bolts and experiment there and improve scale_batch_size within lightning. But possibly it can't guarantee to solve 4.

Additional context

Other issues with the tuner right now we need to address:
#9625
#10560
#10557

thanks to @Borda @SkafteNicki @ethanwharris @akihironitta for helping out with the discussion and possible solutions.

cc @justusschock @awaelchli @akihironitta @Borda

Metadata

Metadata

Assignees

Type

No type

Projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions