You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Bagua (``strategy='bagua'``) (multiple-gpus across many machines with advanced training algorithms)
29
28
30
29
.. note::
31
30
If you request multiple GPUs or nodes without setting a mode, DDP Spawn will be automatically used.
@@ -235,119 +234,6 @@ Comparison of DDP variants and tradeoffs
235
234
- Fast
236
235
237
236
238
-
Bagua
239
-
^^^^^
240
-
`Bagua <https://github.com/BaguaSys/bagua>`_ is a deep learning training acceleration framework which supports
241
-
multiple advanced distributed training algorithms including:
242
-
243
-
- `Gradient AllReduce <https://tutorials.baguasys.com/algorithms/gradient-allreduce>`_ for centralized synchronous communication, where gradients are averaged among all workers.
244
-
- `Decentralized SGD <https://tutorials.baguasys.com/algorithms/decentralized>`_ for decentralized synchronous communication, where each worker exchanges data with one or a few specific workers.
245
-
- `ByteGrad <https://tutorials.baguasys.com/algorithms/bytegrad>`_ and `QAdam <https://tutorials.baguasys.com/algorithms/q-adam>`_ for low precision communication, where data is compressed into low precision before communication.
246
-
- `Asynchronous Model Average <https://tutorials.baguasys.com/algorithms/async-model-average>`_ for asynchronous communication, where workers are not required to be synchronized in the same iteration in a lock-step style.
247
-
248
-
By default, Bagua uses *Gradient AllReduce* algorithm, which is also the algorithm implemented in DDP,
249
-
but Bagua can usually produce a higher training throughput due to its backend written in Rust.
- Strategy for training using the Bagua library, with advanced distributed training algorithms and system optimizations. :ref:`Learn more. <accelerators/gpu_intermediate:Bagua>`
- Colossal-AI provides a collection of parallel components for you. It aims to support you to write your distributed deep learning models just like how you write your model on your laptop. `Learn more. <https://www.colossalai.org/>`__
0 commit comments