Skip to content

Commit c407441

Browse files
authored
Remove the BaguaStrategy (#16746)
* remove bagua * remove * remove docker file entry
1 parent 3902088 commit c407441

File tree

14 files changed

+0
-656
lines changed

14 files changed

+0
-656
lines changed

.azure/gpu-tests-pytorch.yml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -112,8 +112,6 @@ jobs:
112112

113113
- bash: |
114114
set -e
115-
CUDA_VERSION_BAGUA=$(python -c "print([ver for ver in [116,113,111,102] if $CUDA_VERSION_MM >= ver][0])")
116-
pip install "bagua-cuda$CUDA_VERSION_BAGUA"
117115
118116
CUDA_VERSION_MM_COLOSSALAI=$(python -c "import torch ; print(''.join(map(str, torch.version.cuda)))")
119117
CUDA_VERSION_COLOSSALAI=$(python -c "print([ver for ver in [11.3, 11.1] if $CUDA_VERSION_MM_COLOSSALAI >= ver][0])")

dockers/base-cuda/Dockerfile

Lines changed: 0 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -98,19 +98,6 @@ RUN \
9898
pip install -r requirements/pytorch/base.txt --no-cache-dir --find-links https://download.pytorch.org/whl/cu${CUDA_VERSION_MM}/torch_stable.html && \
9999
rm assistant.py
100100

101-
102-
RUN \
103-
# install Bagua
104-
if [[ $PYTORCH_VERSION != "1.13" ]]; then \
105-
CUDA_VERSION_MM=$(python -c "print(''.join('$CUDA_VERSION'.split('.')[:2]))") ; \
106-
CUDA_VERSION_BAGUA=$(python -c "print([ver for ver in [116,113,111,102] if $CUDA_VERSION_MM >= ver][0])") ; \
107-
pip install "bagua-cuda$CUDA_VERSION_BAGUA" ; \
108-
if [[ "$CUDA_VERSION_MM" = "$CUDA_VERSION_BAGUA" ]]; then \
109-
python -c "import bagua_core; bagua_core.install_deps()"; \
110-
fi ; \
111-
python -c "import bagua; print(bagua.__version__)"; \
112-
fi
113-
114101
RUN \
115102
# install ColossalAI
116103
# TODO: 1.13 wheels are not released, remove skip once they are

docs/source-pytorch/accelerators/gpu_intermediate.rst

Lines changed: 0 additions & 114 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,6 @@ Lightning supports multiple ways of doing distributed training.
2525
- Regular (``strategy='ddp'``)
2626
- Spawn (``strategy='ddp_spawn'``)
2727
- Notebook/Fork (``strategy='ddp_notebook'``)
28-
- Bagua (``strategy='bagua'``) (multiple-gpus across many machines with advanced training algorithms)
2928

3029
.. note::
3130
If you request multiple GPUs or nodes without setting a mode, DDP Spawn will be automatically used.
@@ -235,119 +234,6 @@ Comparison of DDP variants and tradeoffs
235234
- Fast
236235

237236

238-
Bagua
239-
^^^^^
240-
`Bagua <https://github.com/BaguaSys/bagua>`_ is a deep learning training acceleration framework which supports
241-
multiple advanced distributed training algorithms including:
242-
243-
- `Gradient AllReduce <https://tutorials.baguasys.com/algorithms/gradient-allreduce>`_ for centralized synchronous communication, where gradients are averaged among all workers.
244-
- `Decentralized SGD <https://tutorials.baguasys.com/algorithms/decentralized>`_ for decentralized synchronous communication, where each worker exchanges data with one or a few specific workers.
245-
- `ByteGrad <https://tutorials.baguasys.com/algorithms/bytegrad>`_ and `QAdam <https://tutorials.baguasys.com/algorithms/q-adam>`_ for low precision communication, where data is compressed into low precision before communication.
246-
- `Asynchronous Model Average <https://tutorials.baguasys.com/algorithms/async-model-average>`_ for asynchronous communication, where workers are not required to be synchronized in the same iteration in a lock-step style.
247-
248-
By default, Bagua uses *Gradient AllReduce* algorithm, which is also the algorithm implemented in DDP,
249-
but Bagua can usually produce a higher training throughput due to its backend written in Rust.
250-
251-
.. code-block:: python
252-
253-
# train on 4 GPUs (using Bagua mode)
254-
trainer = Trainer(strategy="bagua", accelerator="gpu", devices=4)
255-
256-
257-
By specifying the ``algorithm`` in the ``BaguaStrategy``, you can select more advanced training algorithms featured by Bagua:
258-
259-
260-
.. code-block:: python
261-
262-
# train on 4 GPUs, using Bagua Gradient AllReduce algorithm
263-
trainer = Trainer(
264-
strategy=BaguaStrategy(algorithm="gradient_allreduce"),
265-
accelerator="gpu",
266-
devices=4,
267-
)
268-
269-
# train on 4 GPUs, using Bagua ByteGrad algorithm
270-
trainer = Trainer(
271-
strategy=BaguaStrategy(algorithm="bytegrad"),
272-
accelerator="gpu",
273-
devices=4,
274-
)
275-
276-
# train on 4 GPUs, using Bagua Decentralized SGD
277-
trainer = Trainer(
278-
strategy=BaguaStrategy(algorithm="decentralized"),
279-
accelerator="gpu",
280-
devices=4,
281-
)
282-
283-
# train on 4 GPUs, using Bagua Low Precision Decentralized SGD
284-
trainer = Trainer(
285-
strategy=BaguaStrategy(algorithm="low_precision_decentralized"),
286-
accelerator="gpu",
287-
devices=4,
288-
)
289-
290-
# train on 4 GPUs, using Asynchronous Model Average algorithm, with a synchronization interval of 100ms
291-
trainer = Trainer(
292-
strategy=BaguaStrategy(algorithm="async", sync_interval_ms=100),
293-
accelerator="gpu",
294-
devices=4,
295-
)
296-
297-
To use *QAdam*, we need to initialize
298-
`QAdamOptimizer <https://bagua.readthedocs.io/en/latest/autoapi/bagua/torch_api/algorithms/q_adam/index.html#bagua.torch_api.algorithms.q_adam.QAdamOptimizer>`_ first:
299-
300-
.. code-block:: python
301-
302-
from pytorch_lightning.strategies import BaguaStrategy
303-
from bagua.torch_api.algorithms.q_adam import QAdamOptimizer
304-
305-
306-
class MyModel(pl.LightningModule):
307-
...
308-
309-
def configure_optimizers(self):
310-
# initialize QAdam Optimizer
311-
return QAdamOptimizer(self.parameters(), lr=0.05, warmup_steps=100)
312-
313-
314-
model = MyModel()
315-
trainer = Trainer(
316-
accelerator="gpu",
317-
devices=4,
318-
strategy=BaguaStrategy(algorithm="qadam"),
319-
)
320-
trainer.fit(model)
321-
322-
Bagua relies on its own `launcher <https://tutorials.baguasys.com/getting-started/#launch-job>`_ to schedule jobs.
323-
Below, find examples using ``bagua.distributed.launch`` which follows ``torch.distributed.launch`` API:
324-
325-
.. code-block:: bash
326-
327-
# start training with 8 GPUs on a single node
328-
python -m bagua.distributed.launch --nproc_per_node=8 train.py
329-
330-
If the ssh service is available with passwordless login on each node, you can launch the distributed job on a
331-
single node with ``baguarun`` which has a similar syntax as ``mpirun``. When staring the job, ``baguarun`` will
332-
automatically spawn new processes on each of your training node provided by ``--host_list`` option and each node in it
333-
is described as an ip address followed by a ssh port.
334-
335-
.. code-block:: bash
336-
337-
# Run on node1 (or node2) to start training on two nodes (node1 and node2), 8 GPUs per node
338-
baguarun --host_list hostname1:ssh_port1,hostname2:ssh_port2 --nproc_per_node=8 --master_port=port1 train.py
339-
340-
341-
.. note:: You can also start training in the same way as Distributed Data Parallel. However, system optimizations like
342-
`Bagua-Net <https://tutorials.baguasys.com/more-optimizations/bagua-net>`_ and
343-
`Performance autotuning <https://tutorials.baguasys.com/performance-autotuning/>`_ can only be enabled through bagua
344-
launcher. It is worth noting that with ``Bagua-Net``, Distributed Data Parallel can also achieve
345-
better performance without modifying the training script.
346-
347-
348-
See `Bagua Tutorials <https://tutorials.baguasys.com/>`_ for more details on installation and advanced features.
349-
350-
351237
DP caveats
352238
^^^^^^^^^^
353239
In DP each GPU within a machine sees a portion of a batch.

docs/source-pytorch/api_references.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -213,7 +213,6 @@ strategies
213213
:nosignatures:
214214
:template: classtemplate.rst
215215

216-
BaguaStrategy
217216
ColossalAIStrategy
218217
DDPSpawnStrategy
219218
DDPStrategy

docs/source-pytorch/extensions/strategy.rst

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -69,9 +69,6 @@ The below table lists all relevant strategies available in Lightning with their
6969
* - Name
7070
- Class
7171
- Description
72-
* - bagua
73-
- :class:`~pytorch_lightning.strategies.BaguaStrategy`
74-
- Strategy for training using the Bagua library, with advanced distributed training algorithms and system optimizations. :ref:`Learn more. <accelerators/gpu_intermediate:Bagua>`
7572
* - colossalai
7673
- :class:`~pytorch_lightning.strategies.ColossalAIStrategy`
7774
- Colossal-AI provides a collection of parallel components for you. It aims to support you to write your distributed deep learning models just like how you write your model on your laptop. `Learn more. <https://www.colossalai.org/>`__
Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,2 @@
11
if __name__ == "__main__":
2-
import bagua # noqa: F401
32
import deepspeed # noqa: F401

src/lightning/pytorch/plugins/environments/__init__.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,4 +21,3 @@
2121
TorchElasticEnvironment,
2222
XLAEnvironment,
2323
)
24-
from lightning.pytorch.plugins.environments.bagua_environment import BaguaEnvironment # noqa: F401

src/lightning/pytorch/plugins/environments/bagua_environment.py

Lines changed: 0 additions & 62 deletions
This file was deleted.

src/lightning/pytorch/strategies/__init__.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414
from lightning.fabric.strategies.registry import _StrategyRegistry
15-
from lightning.pytorch.strategies.bagua import BaguaStrategy # noqa: F401
1615
from lightning.pytorch.strategies.colossalai import ColossalAIStrategy # noqa: F401
1716
from lightning.pytorch.strategies.ddp import DDPStrategy # noqa: F401
1817
from lightning.pytorch.strategies.ddp_spawn import DDPSpawnStrategy # noqa: F401

0 commit comments

Comments
 (0)