Lightning-AI
diff --git a/‎docs/source-pytorch/common/gradient_accumulation.rst‎
Lines changed: 4 additions & 12 deletions b/‎docs/source-pytorch/common/gradient_accumulation.rst‎
Lines changed: 4 additions & 12 deletions
diff --git a/‎docs/source-pytorch/common/optimization.rst‎
Lines changed: 2 additions & 0 deletions b/‎docs/source-pytorch/common/optimization.rst‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/source-pytorch/common/trainer.rst‎
Lines changed: 2 additions & 4 deletions b/‎docs/source-pytorch/common/trainer.rst‎
Lines changed: 2 additions & 4 deletions
diff --git a/‎src/lightning/pytorch/CHANGELOG.md‎
Lines changed: 8 additions & 0 deletions b/‎src/lightning/pytorch/CHANGELOG.md‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎src/lightning/pytorch/callbacks/gradient_accumulation_scheduler.py‎
Lines changed: 47 additions & 4 deletions b/‎src/lightning/pytorch/callbacks/gradient_accumulation_scheduler.py‎
Lines changed: 47 additions & 4 deletions
diff --git a/‎src/lightning/pytorch/callbacks/stochastic_weight_avg.py‎
Lines changed: 1 addition & 0 deletions b/‎src/lightning/pytorch/callbacks/stochastic_weight_avg.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎src/lightning/pytorch/loops/fit_loop.py‎
Lines changed: 0 additions & 3 deletions b/‎src/lightning/pytorch/loops/fit_loop.py‎
Lines changed: 0 additions & 3 deletions
diff --git a/‎src/lightning/pytorch/strategies/colossalai.py‎
Lines changed: 0 additions & 6 deletions b/‎src/lightning/pytorch/strategies/colossalai.py‎
Lines changed: 0 additions & 6 deletions
diff --git a/‎src/lightning/pytorch/strategies/deepspeed.py‎
Lines changed: 0 additions & 7 deletions b/‎src/lightning/pytorch/strategies/deepspeed.py‎
Lines changed: 0 additions & 7 deletions
diff --git a/‎src/lightning/pytorch/strategies/ipu.py‎
Lines changed: 0 additions & 20 deletions b/‎src/lightning/pytorch/strategies/ipu.py‎
Lines changed: 0 additions & 20 deletions
@@ -19,25 +19,17 @@ effective batch size is increased but there is no memory overhead.
     # Accumulate gradients for 7 batches
     trainer = Trainer(accumulate_grad_batches=7)
 
-You can set different values for it at different epochs by passing a dictionary, where the key represents the epoch at which the value for gradient accumulation
-should be updated.
-
-.. testcode::
-
-        # till 5th epoch, it will accumulate every 8 batches. From 5th epoch
-        # till 9th epoch it will accumulate every 4 batches and after that no accumulation
-        # will happen. Note that you need to use zero-indexed epoch keys here
-        trainer = Trainer(accumulate_grad_batches={0: 8, 4: 4, 8: 1})
-
-Or, you can create custom :class:`~pytorch_lightning.callbacks.gradient_accumulation_scheduler.GradientAccumulationScheduler`
+Optionally, you can make the ``accumulate_grad_batches`` value change over time by using the :class:`~pytorch_lightning.callbacks.gradient_accumulation_scheduler.GradientAccumulationScheduler`.
+Pass in a scheduling dictionary, where the key represents the epoch at which the value for gradient accumulation should be updated.
 
 .. testcode::
 
         from pytorch_lightning.callbacks import GradientAccumulationScheduler
 
-
         # till 5th epoch, it will accumulate every 8 batches. From 5th epoch
         # till 9th epoch it will accumulate every 4 batches and after that no accumulation
         # will happen. Note that you need to use zero-indexed epoch keys here
         accumulator = GradientAccumulationScheduler(scheduling={0: 8, 4: 4, 8: 1})
         trainer = Trainer(callbacks=accumulator)
+
+Note: Not all strategies and accelerators support variable gradient accumulation windows.
@@ -57,6 +57,8 @@ always switch to :ref:`manual optimization <manual_optimization>`.
 Manual optimization is required if you wish to work with multiple optimizers.
 
 
+.. _gradient_accumulation:
+
 Gradient Accumulation
 =====================
 
 
@@ -271,8 +271,7 @@ accumulate_grad_batches
 
 |
 
-Accumulates grads every k batches or as set up in the dict.
-Trainer also calls ``optimizer.step()`` for the last indivisible step number.
+Accumulates gradients over k batches before stepping the optimizer.
 
 .. testcode::
 
@@ -284,8 +283,7 @@ Example::
     # accumulate every 4 batches (effective batch size is batch*4)
     trainer = Trainer(accumulate_grad_batches=4)
 
-    # no accumulation for epochs 1-4. accumulate 3 for epochs 5-10. accumulate 20 after that
-    trainer = Trainer(accumulate_grad_batches={5: 3, 10: 20})
+See also: :ref:`gradient_accumulation` to enable more fine-grained accumulation schedules.
 
 
 benchmark
 
@@ -241,8 +241,10 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 
 - Removed the `using_lbfgs` argument from `LightningModule.optimizer_step` hook ([#16538](https://github.com/Lightning-AI/lightning/pull/16538))
 
+
 - Removed the `Trainer.data_parallel` property. Use `isinstance(trainer.strategy, ParallelStrategy)` instead ([#16703](https://github.com/Lightning-AI/lightning/pull/16703))
 
+
 - Removed support for multiple optimizers in automatic optimization mode ([#16539](https://github.com/Lightning-AI/lightning/pull/16539))
   * Removed `opt_idx` argument from `BaseFinetuning.finetune_function` callback method
   * Removed `opt_idx` argument from `Callback.on_before_optimizer_step` callback method
@@ -265,10 +267,16 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 
 - Removed `PrecisionPlugin.dispatch` ([#16618](https://github.com/Lightning-AI/lightning/pull/16618))
 
+
 - Removed the unused `lightning.pytorch.utilities.metrics.metrics_to_scalars` function ([#16681](https://github.com/Lightning-AI/lightning/pull/16681))
 
+
+- Removed support for passing a scheduling dictionary to `Trainer(accumulate_grad_batches=...)` ([#16729](https://github.com/Lightning-AI/lightning/pull/16729))
+
+
 - Removed the unused `lightning.pytorch.utilities.supporters.{SharedCycleIteratorState,CombinedLoaderIterator}` classes ([#16714](https://github.com/Lightning-AI/lightning/pull/16714))
 
+
 ### Fixed
 
 -
 
@@ -25,6 +25,8 @@
 import lightning.pytorch as pl
 from lightning.pytorch.callbacks.callback import Callback
 from lightning.pytorch.utilities.exceptions import MisconfigurationException
+from lightning.pytorch.utilities.model_helpers import is_overridden
+from lightning.pytorch.utilities.rank_zero import rank_zero_warn
 
 
 class GradientAccumulationScheduler(Callback):
@@ -58,9 +60,6 @@ class GradientAccumulationScheduler(Callback):
         # because epoch (key) should be zero-indexed.
         >>> accumulator = GradientAccumulationScheduler(scheduling={4: 2})
         >>> trainer = Trainer(callbacks=[accumulator])
-
-        # alternatively, pass the scheduling dict directly to the Trainer
-        >>> trainer = Trainer(accumulate_grad_batches={4: 2})
     """
 
     def __init__(self, scheduling: Dict[int, int]):
@@ -82,7 +81,7 @@ def __init__(self, scheduling: Dict[int, int]):
         minimal_epoch = min(scheduling.keys())
         if minimal_epoch < 0:
             raise IndexError(f"Epochs indexing from 1, epoch {minimal_epoch} cannot be interpreted correct")
-        if minimal_epoch != 0:  # if user didnt define first epoch accumulation factor
+        if minimal_epoch != 0:  # if user didn't define first epoch accumulation factor
             scheduling.update({0: 1})
 
         self.scheduling = scheduling
@@ -99,5 +98,49 @@ def get_accumulate_grad_batches(self, epoch: int) -> int:
                 break
         return accumulate_grad_batches
 
+    def on_train_start(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
+        """Performns a configuration validation before training starts and raises errors for incompatible
+        settings."""
+
+        if not pl_module.automatic_optimization:
+            raise RuntimeError(
+                """Automatic gradient accumulation and the `GradientAccumulationScheduler` is not supported for
+                manual optimization. Please remove the callback or switch to automatic optimization."""
+            )
+
+        overridden_optimizer_step = is_overridden("optimizer_step", pl_module)
+        overridden_optimizer_zero_grad = is_overridden("optimizer_zero_grad", pl_module)
+        going_to_accumulate_grad_batches = self.going_to_accumulate_grad_batches()
+        has_overridden_optimization_functions = overridden_optimizer_step or overridden_optimizer_zero_grad
+        if has_overridden_optimization_functions and going_to_accumulate_grad_batches:
+            rank_zero_warn(
+                "When using `Trainer(accumulate_grad_batches != 1)` and overriding"
+                " `LightningModule.optimizer_{step,zero_grad}`, the hooks will not be called on every batch"
+                " (rather, they are called on every optimization step)."
+            )
+
+        # local import to avoid circular import
+        from lightning.pytorch.accelerators import IPUAccelerator
+        from lightning.pytorch.strategies import ColossalAIStrategy, DeepSpeedStrategy
+
+        unsupported_strategies = (DeepSpeedStrategy, ColossalAIStrategy)
+        unsupported_accelerators = (IPUAccelerator,)
+
+        if isinstance(trainer.accelerator, unsupported_accelerators):
+            raise RuntimeError(
+                f"The `{type(trainer.accelerator).__name__}` does not support `accumulate_grad_batches` changing"
+                " between epochs."
+            )
+        if isinstance(trainer.strategy, unsupported_strategies):
+            raise RuntimeError(
+                f"The `{type(trainer.strategy).__name__}` does not support `accumulate_grad_batches` changing"
+                " between epochs."
+            )
+        if trainer.accumulate_grad_batches != 1:
+            raise ValueError(
+                "You have set `accumulate_grad_batches` and are using the `GradientAccumulationScheduler`"
+                " callback. Either remove `accumulate_grad_batches` from the Trainer or remove the callback."
+            )
+
     def on_train_epoch_start(self, trainer: "pl.Trainer", *_: Any) -> None:
         trainer.accumulate_grad_batches = self.get_accumulate_grad_batches(trainer.current_epoch)
@@ -251,6 +251,7 @@ def on_train_epoch_start(self, trainer: "pl.Trainer", pl_module: "pl.LightningMo
             # There is no need to perform either backward or optimizer.step as we are
             # performing only one pass over the train data-loader to compute activation statistics
             # Therefore, we will virtually increase `num_training_batches` by 1 and skip backward.
+            assert isinstance(trainer.num_training_batches, int)
             trainer.num_training_batches += 1
             trainer.fit_loop._skip_backward = True
             self._accumulate_grad_batches = trainer.accumulate_grad_batches
 
@@ -238,9 +238,6 @@ def on_advance_start(self) -> None:
             assert isinstance(self.trainer.train_dataloader, CombinedLoader)
             _set_sampler_epoch(self.trainer.train_dataloader, self.epoch_progress.current.processed)
 
-        # changing gradient according accumulation_scheduler
-        self.trainer.accumulation_scheduler.on_train_epoch_start(self.trainer, self.trainer.lightning_module)
-
         self.epoch_progress.increment_ready()
 
         self.trainer._logger_connector.on_epoch_start()
 
@@ -348,12 +348,6 @@ def setup(self, trainer: "pl.Trainer") -> None:
                     "ColossalAI does not support gradient accumulation now. Please set `accumulate_grad_batches` to 1."
                 )
 
-            accumulation_scheduler = trainer.accumulation_scheduler
-            if accumulation_scheduler.epochs != [0]:
-                raise ValueError(
-                    "ColossalAI currently does not support different `accumulate_grad_batches` at different epochs."
-                )
-
         if not isinstance(self.precision_plugin, ColossalAIPrecisionPlugin):
             raise ValueError("`ColossalAIStrategy` is only compatible with `ColossalAIPrecisionPlugin`.")
 
 
@@ -441,13 +441,6 @@ def init_deepspeed(self) -> None:
                 f"DeepSpeed strategy is only supported on GPU but `{self.accelerator.__class__.__name__}` is used."
             )
 
-        accumulation_scheduler = self.lightning_module.trainer.accumulation_scheduler
-
-        if accumulation_scheduler.epochs != [0]:
-            raise MisconfigurationException(
-                "DeepSpeed currently does not support different `accumulate_grad_batches` at different epochs."
-            )
-
         assert isinstance(self.model, (pl.LightningModule, _LightningPrecisionModuleWrapperBase))
         model = _LightningModuleWrapperBase(forward_module=self.model)
 
 
@@ -105,9 +105,6 @@ def __init__(
         self._optimizer_zero_grad_original: Optional[Callable] = None
 
     def setup(self, trainer: "pl.Trainer") -> None:
-        # set the `accumulate_grad_batches` property as early as possible
-        self._handle_gradient_accumulation_steps()
-
         # patch the dataloader creation function with the custom `poptorch.DataLoader`.
         # this violates the intended control flow for the plugins, but since this is experimental, we have chosen
         # to use the simpler solution before adding abstractions to override the `DataLoader` class
@@ -217,23 +214,6 @@ def _convert_to_poptorch_loader(
         )
         return dataloader
 
-    def _handle_gradient_accumulation_steps(self) -> None:
-        """Override the trainer.accumulation_scheduler to act as ``accumulate_grad_batches=1`` if gradient
-        accumulation has been set.
-
-        ``optimizer_step`` will be called on every batch, and the IPU will handle grad accumulation internally.
-        """
-        assert self.lightning_module is not None
-        accumulation_scheduler = self.lightning_module.trainer.accumulation_scheduler
-
-        if accumulation_scheduler.epochs != [0]:
-            raise MisconfigurationException(
-                "IPUs currently does not support different `accumulate_grad_batches` at different epochs."
-            )
-
-        # TODO(@tchaton): Add support for accumulate_grad_batches being a dictionary
-        accumulation_scheduler.scheduling.update({0: 1})
-
     @property
     def _n_replicate(self) -> int:
         assert self.lightning_module is not None