You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
python -c "fname = 'requirements/pytorch/strategies.txt' ; lines = [line for line in open(fname).readlines() if 'horovod' not in line] ; open(fname, 'w').writelines(lines)"
74
74
python -c "fname = 'requirements/pytorch/strategies.txt' ; lines = [line for line in open(fname).readlines() if 'bagua' not in line] ; open(fname, 'w').writelines(lines)"
In addition to this, the following environment variables need to be set to establish communication across nodes. Check out the documentation on :doc:`Cluster Environment <../clouds/cluster>` for more details.
60
+
61
+
- *MASTER_PORT* - required; has to be a free port on machine with NODE_RANK 0
62
+
- *MASTER_ADDR* - required (except for NODE_RANK 0); address of NODE_RANK 0 node
63
+
- *WORLD_SIZE* - required; how many workers are in the cluster
64
+
- *NODE_RANK* - required; id of the node in the cluster
65
+
66
+
The trainer needs to be instantiated on every node participating in the training.
Copy file name to clipboardExpand all lines: docs/source-pytorch/advanced/model_parallel.rst
+23-6Lines changed: 23 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -212,14 +212,31 @@ PyTorch Fully Sharded Training
212
212
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
213
213
214
214
PyTorch has it's own version of `FSDP <https://pytorch.org/docs/stable/fsdp.html>`_ which is upstreamed from their `fairscale <https://fairscale.readthedocs.io/en/latest/api/nn/fsdp.html>`__ project.
215
-
It was introduced in their `v1.11.0 release <https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/>`_. The API is pretty similar to that of FairScale.
215
+
It was introduced in their `v1.11.0 release <https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/>`_ but it is recommended to use it with PyTorch v1.12 or more and that's what
216
+
Lightning supports. The API is pretty similar to that of FairScale.
216
217
217
-
.. note::
218
-
Currently Fully Sharded Training relies on the user to wrap the model with Fully Sharded within the ``LightningModule``.
219
-
This means you must create a single model that is treated as a ``torch.nn.Module`` within the ``LightningModule``.
220
-
This is a limitation of Fully Sharded Training that will be resolved in the future.
221
218
222
-
To activate parameter sharding, you must wrap your model using the``wrap`` function. Internally in Lightning, we enable a context manager around the ``configure_sharded_model`` function to make sure the ``wrap`` parameters are passed correctly.
219
+
Auto Wrapping
220
+
"""""""""""""
221
+
Model layers should be wrapped in FSDP in a nested way to save peak memory and enable communication and computation overlapping. The
222
+
simplest way to do it is auto wrapping, which can serve as a drop-in replacement for DDP without changing the rest of the code. You don't
223
+
have to ``wrap`` layers manually as in the case of manual wrapping.
Read more `here <https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/#auto-wrapping>`__.
233
+
234
+
235
+
Manual Wrapping
236
+
"""""""""""""""
237
+
238
+
Manual wrapping can be useful to explore complex sharding strategies by applying ``wrap`` selectively to some parts of the model. To activate
239
+
parameter sharding with manual wrapping, you can wrap your model using the ``wrap`` function. Internally in Lightning, we enable a context manager around the ``configure_sharded_model`` function to make sure the ``wrap`` parameters are passed correctly.
223
240
224
241
When not using Fully Sharded these wrap functions are a no-op. This means once the changes have been made, there is no need to remove the changes for other strategies.
0 commit comments