Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 2 additions & 14 deletions docs/source-pytorch/accelerators/gpu_faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,30 +38,18 @@ In DDP, DDP_SPAWN, Deepspeed, DDP_SHARDED your effective batch size will be 7 *
.. note:: Huge batch sizes are actually really bad for convergence. Check out:
`Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour <https://arxiv.org/abs/1706.02677>`_

In DP, which does not support multi-node, the effective batch size will be just 7, regardless of how many devices are being used.
The reason is that the full batch gets split evenly between all devices.

.. code-block:: python

# effective batch size = 7, each GPU sees a batch size of 1 except the last GPU
Trainer(accelerator="gpu", devices=8, strategy="dp")

# effective batch size = 7, first GPU sees a batch size of 4, the other sees batch size 3
Trainer(accelerator="gpu", devices=2, num_nodes=10, strategy="dp")


----


*********************************************************
How do I use multiple GPUs on Jupyter or Colab notebooks?
*********************************************************

To use multiple GPUs on notebooks, use the *DDP_SPAWN*, *DDP_NOTEBOOK*, or *DP* mode.
To use multiple GPUs on notebooks, use the *DDP_SPAWN* or *DDP_NOTEBOOK* mode.

.. code-block:: python

Trainer(accelerator="gpu", devices=4, strategy="ddp_notebook" | "ddp_spawn" | "dp")
Trainer(accelerator="gpu", devices=4, strategy="ddp_notebook" | "ddp_spawn")

If you want to use other models, please launch your training via the command-shell.

Expand Down
133 changes: 13 additions & 120 deletions docs/source-pytorch/accelerators/gpu_intermediate.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ Lightning supports multiple ways of doing distributed training.

|

- Data Parallel (``strategy='dp'``) (multiple-gpus, 1 machine)
- DistributedDataParallel (multiple-gpus across many machines)
- Regular (``strategy='ddp'``)
- Spawn (``strategy='ddp_spawn'``)
Expand All @@ -33,28 +32,6 @@ For a deeper understanding of what Lightning is doing, feel free to read this
`guide <https://medium.com/@_willfalcon/9-tips-for-training-lightning-fast-neural-networks-in-pytorch-8e63a502f565>`_.


Data Parallel
^^^^^^^^^^^^^
:class:`~torch.nn.DataParallel` (DP) splits a batch across k GPUs.
That is, if you have a batch of 32 and use DP with 2 GPUs, each GPU will process 16 samples,
after which the root node will aggregate the results.

.. warning:: DP use is discouraged by PyTorch and Lightning. State is not maintained on the replicas created by the
:class:`~torch.nn.DataParallel` wrapper and you may see errors or misbehavior if you assign state to the module
in the ``forward()`` or ``*_step()`` methods. For the same reason we cannot fully support
:doc:`Manual Optimization <../model/manual_optimization>` with DP. Use DDP which is more stable and at least 3x faster.

.. warning:: DP only supports scattering and gathering primitive collections of tensors like lists, dicts, etc.
Therefore :meth:`~pytorch_lightning.core.hooks.ModelHooks.transfer_batch_to_device` and
:meth:`~pytorch_lightning.core.hooks.ModelHooks.on_after_batch_transfer`
do not apply in this mode and if you have overridden any of them, an exception will be raised.

.. testcode::
:skipif: torch.cuda.device_count() < 2

# train on 2 GPUs (using DP mode)
trainer = Trainer(accelerator="gpu", devices=2, strategy="dp")

Distributed Data Parallel
^^^^^^^^^^^^^^^^^^^^^^^^^
:class:`~torch.nn.parallel.DistributedDataParallel` (DDP) works as follows:
Expand Down Expand Up @@ -189,7 +166,6 @@ The Trainer enables it by default when such environments are detected.
# can also be used in non-interactive environments
trainer = Trainer(accelerator="gpu", devices=8, strategy="ddp_fork")

Data Parallel (``strategy="dp"``) is the only other strategy supported in interactive environments but is slower, is discouraged by PyTorch and has other limitations.
Among the native distributed strategies, regular DDP (``strategy="ddp"``) is still recommended as the go-to strategy over Spawn and Fork/Notebook for its speed and stability but it can only be used with scripts.


Expand Down Expand Up @@ -234,107 +210,24 @@ Comparison of DDP variants and tradeoffs
- Fast


DP caveats
^^^^^^^^^^
In DP each GPU within a machine sees a portion of a batch.
It does roughly the following:

.. testcode::

def distributed_forward(batch, model):
batch = torch.Tensor(32, 8)
gpu_0_batch = batch[:8]
gpu_1_batch = batch[8:16]
gpu_2_batch = batch[16:24]
gpu_3_batch = batch[24:]

y_0 = model_copy_gpu_0(gpu_0_batch)
y_1 = model_copy_gpu_1(gpu_1_batch)
y_2 = model_copy_gpu_2(gpu_2_batch)
y_3 = model_copy_gpu_3(gpu_3_batch)

return [y_0, y_1, y_2, y_3]

So, when Lightning calls any of the `training_step`, `validation_step`, `test_step`
you will only be operating on one of those pieces.

.. testcode::

# the batch here is a portion of the FULL batch
def training_step(self, batch, batch_idx):
y_0 = batch

For most metrics, this doesn't really matter. However, if you want to add something to your computational graph using
all batch parts you can use the `training_step_end` step.

.. testcode::

def training_step_end(self, outputs):
# only use when on dp
outputs = torch.cat(outputs, dim=1)
softmax = softmax(outputs, dim=1)
out = softmax.mean()
return out

In pseudocode, the full sequence is:

.. code-block:: python

# get data
batch = next(dataloader)

# copy model and data to each gpu
batch_splits = split_batch(batch, num_gpus)
models = copy_model_to_gpus(model)

# in parallel, operate on each batch chunk
all_results = []
for gpu_num in gpus:
batch_split = batch_splits[gpu_num]
gpu_model = models[gpu_num]
out = gpu_model(batch_split)
all_results.append(out)

# use the full batch for something like softmax
full_out = model.training_step_end(all_results)

If `training_step_end` is defined it will be called regardless of TPU, DP, DDP, etc... which means
it will behave the same regardless of the backend.

Validation and test step have the same option when using DP.

.. testcode::

def validation_step_end(self, step_output):
...


def test_step_end(self, step_output):
...


Distributed and 16-bit precision
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Below are the possible configurations we support.

+-------+---------+-----+-----+--------+-----------------------------------------------------------------------+
| 1 GPU | 1+ GPUs | DDP | DP | 16-bit | command |
+=======+=========+=====+=====+========+=======================================================================+
| Y | | | | | `Trainer(accelerator="gpu", devices=1)` |
+-------+---------+-----+-----+--------+-----------------------------------------------------------------------+
| Y | | | | Y | `Trainer(accelerator="gpu", devices=1, precision=16)` |
+-------+---------+-----+-----+--------+-----------------------------------------------------------------------+
| | Y | Y | | | `Trainer(accelerator="gpu", devices=k, strategy='ddp')` |
+-------+---------+-----+-----+--------+-----------------------------------------------------------------------+
| | Y | Y | | Y | `Trainer(accelerator="gpu", devices=k, strategy='ddp', precision=16)` |
+-------+---------+-----+-----+--------+-----------------------------------------------------------------------+
| | Y | | Y | | `Trainer(accelerator="gpu", devices=k, strategy='dp')` |
+-------+---------+-----+-----+--------+-----------------------------------------------------------------------+
| | Y | | Y | Y | `Trainer(accelerator="gpu", devices=k, strategy='dp', precision=16)` |
+-------+---------+-----+-----+--------+-----------------------------------------------------------------------+

DDP and DP can also be used with 1 GPU, but there's no reason to do so other than debugging distributed-related issues.
+-------+---------+-----+--------+-----------------------------------------------------------------------+
| 1 GPU | 1+ GPUs | DDP | 16-bit | command |
+=======+=========+=====+========+=======================================================================+
| Y | | | | `Trainer(accelerator="gpu", devices=1)` |
+-------+---------+-----+--------+-----------------------------------------------------------------------+
| Y | | | Y | `Trainer(accelerator="gpu", devices=1, precision=16)` |
+-------+---------+-----+--------+-----------------------------------------------------------------------+
| | Y | Y | | `Trainer(accelerator="gpu", devices=k, strategy='ddp')` |
+-------+---------+-----+--------+-----------------------------------------------------------------------+
| | Y | Y | Y | `Trainer(accelerator="gpu", devices=k, strategy='ddp', precision=16)` |
+-------+---------+-----+--------+-----------------------------------------------------------------------+

DDP can also be used with 1 GPU, but there's no reason to do so other than debugging distributed-related issues.


Implement Your Own Distributed (DDP) training
Expand Down
1 change: 0 additions & 1 deletion docs/source-pytorch/api_references.rst
Original file line number Diff line number Diff line change
Expand Up @@ -216,7 +216,6 @@ strategies
ColossalAIStrategy
DDPSpawnStrategy
DDPStrategy
DataParallelStrategy
DeepSpeedStrategy
FSDPStrategy
HPUParallelStrategy
Expand Down
94 changes: 0 additions & 94 deletions docs/source-pytorch/common/lightning_module.rst
Original file line number Diff line number Diff line change
Expand Up @@ -261,52 +261,6 @@ override the :meth:`~pytorch_lightning.LightningModule.on_training_epoch_end` me
...
self.training_step_outputs.clear() # free memory

Training with DataParallel
==========================

When training using a ``strategy`` that splits data from each batch across GPUs, sometimes you might
need to aggregate them on the main GPU for processing (DP).

In this case, implement the :meth:`~pytorch_lightning.core.module.LightningModule.training_step_end`
method which will have outputs from all the devices and you can accumulate to get the effective results.

.. code-block:: python

def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
loss = F.cross_entropy(y_hat, y)
pred = ...
return {"loss": loss, "pred": pred}


def training_step_end(self, batch_parts):
# predictions from each GPU
predictions = batch_parts["pred"]
# losses from each GPU
losses = batch_parts["loss"]

gpu_0_prediction = predictions[0]
gpu_1_prediction = predictions[1]

# do something with both outputs
return (losses[0] + losses[1]) / 2


Here is the Lightning training pseudo-code for DP:

.. code-block:: python

for batch_idx, train_batch in enumerate(train_dataloader):
batches = split_batch(train_batch)
dp_outs = []
for sub_batch in batches:
# 1
dp_out = training_step(sub_batch, batch_idx)
dp_outs.append(dp_out)

# 2
training_step_end(dp_outs)

------------------

Expand Down Expand Up @@ -399,54 +353,6 @@ Note that this method is called before :meth:`~pytorch_lightning.LightningModule
...
self.validation_step_outputs.clear() # free memory


Validating with DataParallel
============================

When validating using a ``strategy`` that splits data from each batch across GPUs, sometimes you might
need to aggregate them on the main GPU for processing (DP).

In this case, implement the :meth:`~pytorch_lightning.core.module.LightningModule.validation_step_end`
method which will have outputs from all the devices and you can accumulate to get the effective results.

.. code-block:: python

def validation_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
loss = F.cross_entropy(y_hat, y)
pred = ...
return {"loss": loss, "pred": pred}


def validation_step_end(self, batch_parts):
# predictions from each GPU
predictions = batch_parts["pred"]
# losses from each GPU
losses = batch_parts["loss"]

gpu_0_prediction = predictions[0]
gpu_1_prediction = predictions[1]

# do something with both outputs
return (losses[0] + losses[1]) / 2


Here is the Lightning validation pseudo-code for DP:

.. code-block:: python

for batch in dataloader:
batches = split_batch(batch)
dp_outs = []
for sub_batch in batches:
# 1
dp_out = validation_step(sub_batch)
dp_outs.append(dp_out)

# 2
validation_step_end(dp_outs)

----------------

*******
Expand Down
3 changes: 0 additions & 3 deletions docs/source-pytorch/extensions/strategy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -81,9 +81,6 @@ The below table lists all relevant strategies available in Lightning with their
* - ddp
- :class:`~pytorch_lightning.strategies.DDPStrategy`
- Strategy for multi-process single-device training on one or multiple nodes. :ref:`Learn more. <accelerators/gpu_intermediate:Distributed Data Parallel>`
* - dp
- :class:`~pytorch_lightning.strategies.DataParallelStrategy`
- Implements data-parallel training in a single process, i.e., the model gets replicated to each device and each gets a split of the data. :ref:`Learn more. <accelerators/gpu_intermediate:Data Parallel>`
* - deepspeed
- :class:`~pytorch_lightning.strategies.DeepSpeedStrategy`
- Provides capabilities to run training using the DeepSpeed library, with training optimizations for large billion parameter models. :ref:`Learn more. <advanced/model_parallel:deepspeed>`
Expand Down
15 changes: 1 addition & 14 deletions docs/source-pytorch/guides/speed.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,22 +49,9 @@ GPU Training Speedup Tips
When training on single or multiple GPU machines, Lightning offers a host of advanced optimizations to improve throughput, memory efficiency, and model scaling.
Refer to :doc:`Advanced GPU Optimized Training for more details <../advanced/model_parallel>`.

Prefer DDP Over DP
^^^^^^^^^^^^^^^^^^
:class:`~pytorch_lightning.strategies.dp.DataParallelStrategy` performs three GPU transfers for EVERY batch:

1. Copy the model to the device.
2. Copy the data to the device.
3. Copy the outputs of each device back to the main device.

.. image:: https://pl-public-data.s3.amazonaws.com/docs/static/images/distributed_training/dp.gif
:alt: Animation showing DP execution.
:width: 500
:align: center

|

Whereas :class:`~pytorch_lightning.strategies.ddp.DDPStrategy` only performs two transfer operations, making DDP much faster than DP:
:class:`~pytorch_lightning.strategies.ddp.DDPStrategy` only performs two transfer operations for each step, making it the simplest distributed training strategy:

1. Moving data to the device.
2. Transfer and sync gradients.
Expand Down
3 changes: 3 additions & 0 deletions src/lightning/pytorch/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Removed support for passing a scheduling dictionary to `Trainer(accumulate_grad_batches=...)` ([#16729](https://github.com/Lightning-AI/lightning/pull/16729))


- Removed support for `DataParallel` (`strategy='dp'`) and the `LightningParallelModule`-Wrapper, ([#16748](https://github.com/Lightning-AI/lightning/pull/16748))


- Removed the unused `lightning.pytorch.utilities.supporters.{SharedCycleIteratorState,CombinedLoaderIterator}` classes ([#16714](https://github.com/Lightning-AI/lightning/pull/16714))


Expand Down
Loading