Lightning-AI · awaelchli · Feb 16, 2023 · Feb 14, 2023 · Feb 14, 2023 · Feb 14, 2023
@@ -38,30 +38,18 @@ In DDP, DDP_SPAWN, Deepspeed, DDP_SHARDED your effective batch size will be 7 *
 .. note:: Huge batch sizes are actually really bad for convergence. Check out:
         `Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour <https://arxiv.org/abs/1706.02677>`_
 
-In DP, which does not support multi-node, the effective batch size will be just 7, regardless of how many devices are being used.
-The reason is that the full batch gets split evenly between all devices.
-
-.. code-block:: python
-
-    # effective batch size = 7, each GPU sees a batch size of 1 except the last GPU
-    Trainer(accelerator="gpu", devices=8, strategy="dp")
-
-    # effective batch size = 7, first GPU sees a batch size of 4, the other sees batch size 3
-    Trainer(accelerator="gpu", devices=2, num_nodes=10, strategy="dp")
-
-
 ----
 
 
 *********************************************************
 How do I use multiple GPUs on Jupyter or Colab notebooks?
 *********************************************************
 
-To use multiple GPUs on notebooks, use the *DDP_SPAWN*, *DDP_NOTEBOOK*, or *DP* mode.
+To use multiple GPUs on notebooks, use the *DDP_SPAWN* or *DDP_NOTEBOOK* mode.
 
 .. code-block:: python
 
-    Trainer(accelerator="gpu", devices=4, strategy="ddp_notebook" | "ddp_spawn" | "dp")
+    Trainer(accelerator="gpu", devices=4, strategy="ddp_notebook" | "ddp_spawn")
 
 If you want to use other models, please launch your training via the command-shell.
 

@@ -20,7 +20,6 @@ Lightning supports multiple ways of doing distributed training.
 
 |
 
-- Data Parallel (``strategy='dp'``) (multiple-gpus, 1 machine)
 - DistributedDataParallel (multiple-gpus across many machines)
     - Regular (``strategy='ddp'``)
     - Spawn (``strategy='ddp_spawn'``)
@@ -33,28 +32,6 @@ For a deeper understanding of what Lightning is doing, feel free to read this
 `guide <https://medium.com/@_willfalcon/9-tips-for-training-lightning-fast-neural-networks-in-pytorch-8e63a502f565>`_.
 
 
-Data Parallel
-^^^^^^^^^^^^^
-:class:`~torch.nn.DataParallel` (DP) splits a batch across k GPUs.
-That is, if you have a batch of 32 and use DP with 2 GPUs, each GPU will process 16 samples,
-after which the root node will aggregate the results.
-
-.. warning:: DP use is discouraged by PyTorch and Lightning. State is not maintained on the replicas created by the
-    :class:`~torch.nn.DataParallel` wrapper and you may see errors or misbehavior if you assign state to the module
-    in the ``forward()`` or ``*_step()`` methods. For the same reason we cannot fully support
-    :doc:`Manual Optimization <../model/manual_optimization>` with DP. Use DDP which is more stable and at least 3x faster.
-
-.. warning:: DP only supports scattering and gathering primitive collections of tensors like lists, dicts, etc.
-    Therefore :meth:`~pytorch_lightning.core.hooks.ModelHooks.transfer_batch_to_device` and
-    :meth:`~pytorch_lightning.core.hooks.ModelHooks.on_after_batch_transfer`
-    do not apply in this mode and if you have overridden any of them, an exception will be raised.
-
-.. testcode::
-    :skipif: torch.cuda.device_count() < 2
-
-    # train on 2 GPUs (using DP mode)
-    trainer = Trainer(accelerator="gpu", devices=2, strategy="dp")
-
 Distributed Data Parallel
 ^^^^^^^^^^^^^^^^^^^^^^^^^
 :class:`~torch.nn.parallel.DistributedDataParallel` (DDP) works as follows:
@@ -189,7 +166,6 @@ The Trainer enables it by default when such environments are detected.
     # can also be used in non-interactive environments
     trainer = Trainer(accelerator="gpu", devices=8, strategy="ddp_fork")
 
-Data Parallel (``strategy="dp"``) is the only other strategy supported in interactive environments but is slower, is discouraged by PyTorch and has other limitations.
 Among the native distributed strategies, regular DDP (``strategy="ddp"``) is still recommended as the go-to strategy over Spawn and Fork/Notebook for its speed and stability but it can only be used with scripts.
 
 
@@ -234,107 +210,24 @@ Comparison of DDP variants and tradeoffs
      - Fast
 
 
-DP caveats
-^^^^^^^^^^
-In DP each GPU within a machine sees a portion of a batch.
-It does roughly the following:
-
-.. testcode::
-
-    def distributed_forward(batch, model):
-        batch = torch.Tensor(32, 8)
-        gpu_0_batch = batch[:8]
-        gpu_1_batch = batch[8:16]
-        gpu_2_batch = batch[16:24]
-        gpu_3_batch = batch[24:]
-
-        y_0 = model_copy_gpu_0(gpu_0_batch)
-        y_1 = model_copy_gpu_1(gpu_1_batch)
-        y_2 = model_copy_gpu_2(gpu_2_batch)
-        y_3 = model_copy_gpu_3(gpu_3_batch)
-
-        return [y_0, y_1, y_2, y_3]
-
-So, when Lightning calls any of the `training_step`, `validation_step`, `test_step`
-you will only be operating on one of those pieces.
-
-.. testcode::
-
-    # the batch here is a portion of the FULL batch
-    def training_step(self, batch, batch_idx):
-        y_0 = batch
-
-For most metrics, this doesn't really matter. However, if you want to add something to your computational graph using
-all batch parts you can use the `training_step_end` step.
-
-.. testcode::
-
-    def training_step_end(self, outputs):
-        # only use when  on dp
-        outputs = torch.cat(outputs, dim=1)
-        softmax = softmax(outputs, dim=1)
-        out = softmax.mean()
-        return out
-
-In pseudocode, the full sequence is:
-
-.. code-block:: python
-
-    # get data
-    batch = next(dataloader)
-
-    # copy model and data to each gpu
-    batch_splits = split_batch(batch, num_gpus)
-    models = copy_model_to_gpus(model)
-
-    # in parallel, operate on each batch chunk
-    all_results = []
-    for gpu_num in gpus:
-        batch_split = batch_splits[gpu_num]
-        gpu_model = models[gpu_num]
-        out = gpu_model(batch_split)
-        all_results.append(out)
-
-    # use the full batch for something like softmax
-    full_out = model.training_step_end(all_results)
-
-If `training_step_end` is defined it will be called regardless of TPU, DP, DDP, etc... which means
-it will behave the same regardless of the backend.
-
-Validation and test step have the same option when using DP.
-
-.. testcode::
-
-    def validation_step_end(self, step_output):
-        ...
-
-
-    def test_step_end(self, step_output):
-        ...
-
-
 Distributed and 16-bit precision
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Below are the possible configurations we support.
 
-+-------+---------+-----+-----+--------+-----------------------------------------------------------------------+
-| 1 GPU | 1+ GPUs | DDP  | DP | 16-bit | command                                                               |
-+=======+=========+=====+=====+========+=======================================================================+
-| Y     |         |     |     |        | `Trainer(accelerator="gpu", devices=1)`                               |
-+-------+---------+-----+-----+--------+-----------------------------------------------------------------------+
-| Y     |         |     |     | Y      | `Trainer(accelerator="gpu", devices=1, precision=16)`                 |
-+-------+---------+-----+-----+--------+-----------------------------------------------------------------------+
-|       | Y       | Y   |     |        | `Trainer(accelerator="gpu", devices=k, strategy='ddp')`               |
-+-------+---------+-----+-----+--------+-----------------------------------------------------------------------+
-|       | Y       | Y   |     | Y      | `Trainer(accelerator="gpu", devices=k, strategy='ddp', precision=16)` |
-+-------+---------+-----+-----+--------+-----------------------------------------------------------------------+
-|       | Y       |     | Y   |        | `Trainer(accelerator="gpu", devices=k, strategy='dp')`                |
-+-------+---------+-----+-----+--------+-----------------------------------------------------------------------+
-|       | Y       |     | Y   | Y      | `Trainer(accelerator="gpu", devices=k, strategy='dp', precision=16)`  |
-+-------+---------+-----+-----+--------+-----------------------------------------------------------------------+
-
-DDP and DP can also be used with 1 GPU, but there's no reason to do so other than debugging distributed-related issues.
++-------+---------+-----+--------+-----------------------------------------------------------------------+
+| 1 GPU | 1+ GPUs | DDP | 16-bit | command                                                               |
++=======+=========+=====+========+=======================================================================+
+| Y     |         |     |        | `Trainer(accelerator="gpu", devices=1)`                               |
++-------+---------+-----+--------+-----------------------------------------------------------------------+
+| Y     |         |     | Y      | `Trainer(accelerator="gpu", devices=1, precision=16)`                 |
++-------+---------+-----+--------+-----------------------------------------------------------------------+
+|       | Y       | Y   |        | `Trainer(accelerator="gpu", devices=k, strategy='ddp')`               |
++-------+---------+-----+--------+-----------------------------------------------------------------------+
+|       | Y       | Y   | Y      | `Trainer(accelerator="gpu", devices=k, strategy='ddp', precision=16)` |
++-------+---------+-----+--------+-----------------------------------------------------------------------+
+
+DDP can also be used with 1 GPU, but there's no reason to do so other than debugging distributed-related issues.
 
 
 Implement Your Own Distributed (DDP) training

@@ -216,7 +216,6 @@ strategies
     ColossalAIStrategy
     DDPSpawnStrategy
     DDPStrategy
-    DataParallelStrategy
     DeepSpeedStrategy
     FSDPStrategy
     HPUParallelStrategy

@@ -261,52 +261,6 @@ override the :meth:`~pytorch_lightning.LightningModule.on_training_epoch_end` me
         ...
         self.training_step_outputs.clear()  # free memory
 
-Training with DataParallel
-==========================
-
-When training using a ``strategy`` that splits data from each batch across GPUs, sometimes you might
-need to aggregate them on the main GPU for processing (DP).
-
-In this case, implement the :meth:`~pytorch_lightning.core.module.LightningModule.training_step_end`
-method which will have outputs from all the devices and you can accumulate to get the effective results.
-
-.. code-block:: python
-
-     def training_step(self, batch, batch_idx):
-         x, y = batch
-         y_hat = self.model(x)
-         loss = F.cross_entropy(y_hat, y)
-         pred = ...
-         return {"loss": loss, "pred": pred}
-
-
-     def training_step_end(self, batch_parts):
-         # predictions from each GPU
-         predictions = batch_parts["pred"]
-         # losses from each GPU
-         losses = batch_parts["loss"]
-
-         gpu_0_prediction = predictions[0]
-         gpu_1_prediction = predictions[1]
-
-         # do something with both outputs
-         return (losses[0] + losses[1]) / 2
-
-
-Here is the Lightning training pseudo-code for DP:
-
-.. code-block:: python
-
-    for batch_idx, train_batch in enumerate(train_dataloader):
-        batches = split_batch(train_batch)
-        dp_outs = []
-        for sub_batch in batches:
-            # 1
-            dp_out = training_step(sub_batch, batch_idx)
-            dp_outs.append(dp_out)
-
-        # 2
-        training_step_end(dp_outs)
 
 ------------------
 
@@ -399,54 +353,6 @@ Note that this method is called before :meth:`~pytorch_lightning.LightningModule
         ...
         self.validation_step_outputs.clear()  # free memory
 
-
-Validating with DataParallel
-============================
-
-When validating using a ``strategy`` that splits data from each batch across GPUs, sometimes you might
-need to aggregate them on the main GPU for processing (DP).
-
-In this case, implement the :meth:`~pytorch_lightning.core.module.LightningModule.validation_step_end`
-method which will have outputs from all the devices and you can accumulate to get the effective results.
-
-.. code-block:: python
-
-     def validation_step(self, batch, batch_idx):
-         x, y = batch
-         y_hat = self.model(x)
-         loss = F.cross_entropy(y_hat, y)
-         pred = ...
-         return {"loss": loss, "pred": pred}
-
-
-     def validation_step_end(self, batch_parts):
-         # predictions from each GPU
-         predictions = batch_parts["pred"]
-         # losses from each GPU
-         losses = batch_parts["loss"]
-
-         gpu_0_prediction = predictions[0]
-         gpu_1_prediction = predictions[1]
-
-         # do something with both outputs
-         return (losses[0] + losses[1]) / 2
-
-
-Here is the Lightning validation pseudo-code for DP:
-
-.. code-block:: python
-
-    for batch in dataloader:
-        batches = split_batch(batch)
-        dp_outs = []
-        for sub_batch in batches:
-            # 1
-            dp_out = validation_step(sub_batch)
-            dp_outs.append(dp_out)
-
-        # 2
-        validation_step_end(dp_outs)
-
 ----------------
 
 *******

@@ -81,9 +81,6 @@ The below table lists all relevant strategies available in Lightning with their
    * - ddp
      - :class:`~pytorch_lightning.strategies.DDPStrategy`
      - Strategy for multi-process single-device training on one or multiple nodes. :ref:`Learn more. <accelerators/gpu_intermediate:Distributed Data Parallel>`
-   * - dp
-     - :class:`~pytorch_lightning.strategies.DataParallelStrategy`
-     - Implements data-parallel training in a single process, i.e., the model gets replicated to each device and each gets a split of the data. :ref:`Learn more. <accelerators/gpu_intermediate:Data Parallel>`
    * - deepspeed
      - :class:`~pytorch_lightning.strategies.DeepSpeedStrategy`
      - Provides capabilities to run training using the DeepSpeed library, with training optimizations for large billion parameter models. :ref:`Learn more. <advanced/model_parallel:deepspeed>`

@@ -49,22 +49,9 @@ GPU Training Speedup Tips
 When training on single or multiple GPU machines, Lightning offers a host of advanced optimizations to improve throughput, memory efficiency, and model scaling.
 Refer to :doc:`Advanced GPU Optimized Training for more details <../advanced/model_parallel>`.
 
-Prefer DDP Over DP
-^^^^^^^^^^^^^^^^^^
-:class:`~pytorch_lightning.strategies.dp.DataParallelStrategy` performs three GPU transfers for EVERY batch:
-
-1. Copy the model to the device.
-2. Copy the data to the device.
-3. Copy the outputs of each device back to the main device.
-
-.. image:: https://pl-public-data.s3.amazonaws.com/docs/static/images/distributed_training/dp.gif
-    :alt: Animation showing DP execution.
-    :width: 500
-    :align: center
-
 |
 
-Whereas :class:`~pytorch_lightning.strategies.ddp.DDPStrategy` only performs two transfer operations, making DDP much faster than DP:
+:class:`~pytorch_lightning.strategies.ddp.DDPStrategy` only performs two transfer operations for each step, making it the simplest distributed training strategy:
 
 1. Moving data to the device.
 2. Transfer and sync gradients.

@@ -288,6 +288,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Removed support for passing a scheduling dictionary to `Trainer(accumulate_grad_batches=...)` ([#16729](https://github.com/Lightning-AI/lightning/pull/16729))
 
 
+- Removed support for `DataParallel` (`strategy='dp'`) and the `LightningParallelModule`-Wrapper, ([#16748](https://github.com/Lightning-AI/lightning/pull/16748))
+
+
 - Removed the unused `lightning.pytorch.utilities.supporters.{SharedCycleIteratorState,CombinedLoaderIterator}` classes ([#16714](https://github.com/Lightning-AI/lightning/pull/16714))
Original file line number	Diff line number	Diff line change
Expand Up		@@ -288,6 +288,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
		- Removed support for passing a scheduling dictionary to `Trainer(accumulate_grad_batches=...)` ([#16729](https://github.com/Lightning-AI/lightning/pull/16729))


		- Removed support for `DataParallel` (`strategy='dp'`) and the `LightningParallelModule`-Wrapper, ([#16748](https://github.com/Lightning-AI/lightning/pull/16748))


		- Removed the unused `lightning.pytorch.utilities.supporters.{SharedCycleIteratorState,CombinedLoaderIterator}` classes ([#16714](https://github.com/Lightning-AI/lightning/pull/16714))


Expand Down