Skip to content
Merged
Show file tree
Hide file tree
Changes from 69 commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
3e187d1
update
tchaton Nov 6, 2022
092b36a
update
tchaton Nov 6, 2022
fcb2ea2
update
tchaton Nov 6, 2022
6481338
Merge branch 'master' into add_multi_node_examples
tchaton Nov 6, 2022
e1271ce
update
tchaton Nov 6, 2022
478d0f0
Merge branch 'add_multi_node_examples' of https://github.com/Lightnin…
tchaton Nov 6, 2022
0c5a079
update
tchaton Nov 6, 2022
804c5cb
update
tchaton Nov 6, 2022
baf1cae
update
tchaton Nov 6, 2022
a393f58
update
tchaton Nov 6, 2022
38f1c72
update
tchaton Nov 6, 2022
4ddb3ae
update
tchaton Nov 6, 2022
ed93320
update
tchaton Nov 6, 2022
402b6fd
update
tchaton Nov 6, 2022
dece823
update
tchaton Nov 6, 2022
4b7e8af
update
tchaton Nov 6, 2022
db336d3
update
tchaton Nov 6, 2022
2cd0d54
update
tchaton Nov 6, 2022
7da57cd
update
tchaton Nov 6, 2022
651590e
update
tchaton Nov 6, 2022
17ac6db
update
tchaton Nov 6, 2022
589ff92
update
tchaton Nov 6, 2022
d221d35
update
tchaton Nov 6, 2022
53597c7
Merge branch 'master' into add_multi_node_examples
tchaton Nov 6, 2022
fa6def5
update
tchaton Nov 6, 2022
0adcdf3
Merge branch 'add_multi_node_examples' of https://github.com/Lightnin…
tchaton Nov 6, 2022
f2fa720
update
tchaton Nov 6, 2022
7778c58
update
tchaton Nov 6, 2022
c005373
update
tchaton Nov 6, 2022
f8fda2e
update
tchaton Nov 6, 2022
00119f6
update
tchaton Nov 6, 2022
0f4e5e5
update
tchaton Nov 6, 2022
9e437df
update
tchaton Nov 6, 2022
dec2def
update
tchaton Nov 6, 2022
2c4b26c
update
tchaton Nov 6, 2022
b596dbe
update
tchaton Nov 6, 2022
45fba58
update
tchaton Nov 6, 2022
8484601
update
tchaton Nov 6, 2022
45dde23
update
tchaton Nov 6, 2022
09b1b5d
update
tchaton Nov 6, 2022
0732886
update
tchaton Nov 6, 2022
6259800
update
tchaton Nov 7, 2022
f48004d
update
tchaton Nov 7, 2022
56b4bc9
update
tchaton Nov 7, 2022
e45ea15
update
tchaton Nov 7, 2022
089b677
update
tchaton Nov 7, 2022
5b14153
update
tchaton Nov 7, 2022
6433868
update
tchaton Nov 7, 2022
7ed3313
update
tchaton Nov 7, 2022
3e78c0c
update
tchaton Nov 7, 2022
7c8e82f
update
tchaton Nov 7, 2022
af7eb60
update
tchaton Nov 7, 2022
b89c47d
update
tchaton Nov 7, 2022
bef788a
update
tchaton Nov 7, 2022
61b23d9
update
tchaton Nov 7, 2022
381e013
update
tchaton Nov 7, 2022
44a1cc2
Merge branch 'master' into expose_work_runner
tchaton Nov 7, 2022
ea01249
update
tchaton Nov 7, 2022
e39d1da
Merge branch 'expose_work_runner' of https://github.com/Lightning-AI/…
tchaton Nov 7, 2022
e079633
update
tchaton Nov 7, 2022
3ea7b54
update
tchaton Nov 7, 2022
7fde3c5
update
tchaton Nov 7, 2022
a60a480
update
tchaton Nov 7, 2022
f0a402e
update
tchaton Nov 7, 2022
9921cf1
update
tchaton Nov 7, 2022
6067c9a
update
tchaton Nov 7, 2022
34d618e
update
tchaton Nov 7, 2022
2af5549
Merge branch 'master' into expose_work_runner
tchaton Nov 7, 2022
2227f35
Merge branch 'master' into expose_work_runner
tchaton Nov 7, 2022
a909169
Apply suggestions from code review
Borda Nov 7, 2022
7975fa8
update
tchaton Nov 8, 2022
512bd73
update
tchaton Nov 8, 2022
9ac00b1
Merge branch 'master' into expose_work_runner
tchaton Nov 8, 2022
86e27c9
update
tchaton Nov 8, 2022
6c03a31
update
tchaton Nov 8, 2022
ffea172
update
tchaton Nov 8, 2022
e5a4880
add note
tchaton Nov 8, 2022
8c59907
Update examples/app_multi_node/README.md
tchaton Nov 8, 2022
74dfa0f
Merge branch 'master' into expose_work_runner
tchaton Nov 8, 2022
b157a21
update
tchaton Nov 8, 2022
e1bce24
Merge branch 'expose_work_runner' of https://github.com/Lightning-AI/…
tchaton Nov 8, 2022
41b6b81
Merge branch 'master' into expose_work_runner
tchaton Nov 8, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/ci-app-examples.yml
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,8 @@ jobs:
if: ${{ matrix.pkg-name != 'lightning' }}
run: |
python .actions/assistant.py copy_replace_imports --source_dir="./examples" --source_import="lightning.app,lightning" --target_import="lightning_app,lightning_app"
python .actions/assistant.py copy_replace_imports --source_dir="./examples" --source_import="lightning_app.lite" --target_import="lightning_lite"
python .actions/assistant.py copy_replace_imports --source_dir="./examples" --source_import="lightning_app.pytorch" --target_import="pytorch_lightning"

- name: Switch coverage scope
run: python -c "print('COVERAGE_SCOPE=' + str('lightning' if '${{matrix.pkg-name}}' == 'lightning' else 'lightning_app'))" >> $GITHUB_ENV
Expand Down
18 changes: 14 additions & 4 deletions examples/app_multi_node/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,32 +6,42 @@ Lightning supports makes multi-node training simple by providing a simple interf

You can run the multi-node raw PyTorch by running the following commands.

Here is an example where you setup spawn your processes yourself.

```bash
lightning run app app_torch_work.py
```

or you can use the built-in component for it.

```bash
lightning run app app_component_torch.py
```

## Multi Node with raw PyTorch + Lite

You can run the multi-node raw PyTorch and Lite by running the following commands.

This removes all the boilerplate around distributed strategy by you remain in control of your loops.

```bash
lightning run app app_lite_work.py
lightning run app app_component_lite.py
```

## Multi Node with PyTorch Lightning

Lightning supports running PyTorch Lightning from a script or within a Lightning Work.

### Multi Node PyTorch Lightning Script
You can either run a script directly

```bash
lightning run app app_pl_script.py
```

### Multi Node PyTorch Lightning Work
or run your code within as a work.

```bash
lightning run app app_pl_work.py
lightning run app app_component_pl.py
```

## Multi Node with any frameworks
Expand Down
37 changes: 37 additions & 0 deletions examples/app_multi_node/app_component_lite.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import torch

import lightning as L
from lightning.app.components import LiteMultiNode
from lightning.lite import LightningLite


class LitePyTorchDistributed(L.LightningWork):
@staticmethod
def run():
# 1. Create LightningLite.
lite = LightningLite(strategy="ddp", precision="bf16")

# 2. Prepare distributed model and optimizer.
model = torch.nn.Linear(32, 2)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
model, optimizer = lite.setup(model, optimizer)
criterion = torch.nn.MSELoss()

# 3. Train the model for 50 steps.
for step in range(50):
model.zero_grad()
x = torch.randn(64, 32).to(lite.device)
output = model(x)
loss = criterion(output, torch.ones_like(output))
print(f"global_rank: {lite.global_rank} step: {step} loss: {loss}")
lite.backward(loss)
optimizer.step()


app = L.LightningApp(
LiteMultiNode(
LitePyTorchDistributed,
cloud_compute=L.CloudCompute("gpu-fast-multi"), # 4 x V100,
num_nodes=2,
)
)
24 changes: 24 additions & 0 deletions examples/app_multi_node/app_component_pl.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import lightning as L
from lightning.app.components import PyTorchLightningMultiNode
from lightning.pytorch.demos.boring_classes import BoringModel


class PyTorchLightningDistributed(L.LightningWork):
@staticmethod
def run():
model = BoringModel()
trainer = L.Trainer(
max_epochs=10,
strategy="ddp",
)
trainer.fit(model)


compute = L.CloudCompute("gpu-fast-multi") # 4 x V100
app = L.LightningApp(
PyTorchLightningMultiNode(
PyTorchLightningDistributed,
num_nodes=2,
cloud_compute=compute,
)
)
46 changes: 46 additions & 0 deletions examples/app_multi_node/app_component_torch.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
import torch
from torch.nn.parallel.distributed import DistributedDataParallel

import lightning as L
from lightning.app.components import PyTorchSpawnMultiNode


class PyTorchDistributed(L.LightningWork):

# Note: Only staticmethod are support for now with `PyTorchSpawnMultiNode`
@staticmethod
def run(
world_size: int,
node_rank: int,
global_rank: str,
local_rank: int,
):
# 1. Prepare distributed model
model = torch.nn.Linear(32, 2)
device = torch.device(f"cuda:{local_rank}") if torch.cuda.is_available() else torch.device("cpu")
device_ids = device if torch.cuda.is_available() else None
model = DistributedDataParallel(model, device_ids=device_ids).to(device)

# 2. Prepare loss and optimizer
criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# 3. Train the model for 50 steps.
for step in range(50):
model.zero_grad()
x = torch.randn(64, 32).to(device)
output = model(x)
loss = criterion(output, torch.ones_like(output))
print(f"global_rank: {global_rank} step: {step} loss: {loss}")
loss.backward()
optimizer.step()


compute = L.CloudCompute("gpu-fast-multi") # 4 x V100
app = L.LightningApp(
PyTorchSpawnMultiNode(
PyTorchDistributed,
num_nodes=2,
cloud_compute=compute,
)
)
59 changes: 0 additions & 59 deletions examples/app_multi_node/app_lite_work.py

This file was deleted.

38 changes: 0 additions & 38 deletions examples/app_multi_node/app_pl_work.py

This file was deleted.

2 changes: 1 addition & 1 deletion examples/app_multi_node/app_torch_work.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ def run(
)


compute = L.CloudCompute("gpu-fast-multi") # 4xV100
compute = L.CloudCompute("gpu-fast-multi") # 4 x V100
app = L.LightningApp(
MultiNode(
PyTorchDistributed,
Expand Down
5 changes: 4 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,10 @@ warn_no_return = "False"
# the list can be generated with:
# mypy --no-error-summary 2>&1 | tr ':' ' ' | awk '{print $1}' | sort | uniq | sed 's/\.py//g; s|src/||g; s|\/|\.|g' | xargs -I {} echo '"{}",'
module = [
"lightning_app.components.multi_node",
"lightning_app.components.multi_node.lite",
"lightning_app.components.multi_node.base",
"lightning_app.components.multi_node.pytorch_spawn",
"lightning_app.components.multi_node.pl",
"lightning_app.api.http_methods",
"lightning_app.api.request_types",
"lightning_app.cli.commands.app_commands",
Expand Down
1 change: 1 addition & 0 deletions requirements/app/examples.txt
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
pytorch-lightning>=1.8.0
lightning_lite
2 changes: 1 addition & 1 deletion src/lightning_app/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).

- Added a `MultiNode` Component to run with distributed computation with any frameworks ([#15524](https://github.com/Lightning-AI/lightning/pull/15524))

-
- Expose `RunWorkExecutor` to the work and provides default ones for the `MultiNode` Component ([#15561](https://github.com/Lightning-AI/lightning/pull/15561))


### Changed
Expand Down
10 changes: 9 additions & 1 deletion src/lightning_app/components/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
from lightning_app.components.database.client import DatabaseClient
from lightning_app.components.database.server import Database
from lightning_app.components.multi_node import MultiNode
from lightning_app.components.multi_node import (
LiteMultiNode,
MultiNode,
PyTorchLightningMultiNode,
PyTorchSpawnMultiNode,
)
from lightning_app.components.python.popen import PopenPythonScript
from lightning_app.components.python.tracer import Code, TracerPythonScript
from lightning_app.components.serve.gradio import ServeGradio
Expand All @@ -18,6 +23,9 @@
"ServeStreamlit",
"ModelInferenceAPI",
"MultiNode",
"LiteMultiNode",
"LightningTrainingComponent",
"PyTorchLightningScriptRunner",
"PyTorchSpawnMultiNode",
"PyTorchLightningMultiNode",
]
6 changes: 6 additions & 0 deletions src/lightning_app/components/multi_node/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from lightning_app.components.multi_node.base import MultiNode
from lightning_app.components.multi_node.lite import LiteMultiNode
from lightning_app.components.multi_node.pl import PyTorchLightningMultiNode
from lightning_app.components.multi_node.pytorch_spawn import PyTorchSpawnMultiNode

__all__ = ["LiteMultiNode", "MultiNode", "PyTorchSpawnMultiNode", "PyTorchLightningMultiNode"]
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
from typing import Any, Type
from typing import Any, Callable, Optional, Type, Union

from lightning_app import structures
from lightning_app.core.flow import LightningFlow
from lightning_app.core.work import LightningWork
from lightning_app.utilities.enum import WorkStageStatus
from lightning_app.utilities.packaging.cloud_compute import CloudCompute
from lightning_app.utilities.proxies import WorkRunExecutor


class MultiNode(LightningFlow):
Expand All @@ -13,6 +14,7 @@ def __init__(
work_cls: Type["LightningWork"],
num_nodes: int,
cloud_compute: "CloudCompute",
executor_cls: Optional[Union[Type[WorkRunExecutor], Callable]] = None,
*work_args: Any,
**work_kwargs: Any,
) -> None:
Expand Down Expand Up @@ -48,6 +50,7 @@ def run(
work_cls: The work to be executed
num_nodes: Number of nodes.
cloud_compute: The cloud compute object used in the cloud.
executor_cls: Customize the work run method execution.
work_args: Arguments to be provided to the work on instantiation.
work_kwargs: Keywords arguments to be provided to the work on instantiation.
"""
Expand All @@ -58,6 +61,10 @@ def run(
self._cloud_compute = cloud_compute
self._work_args = work_args
self._work_kwargs = work_kwargs

if executor_cls:
self._work_kwargs["run_executor_cls"] = executor_cls

self.has_started = False

def run(self) -> None:
Expand All @@ -74,6 +81,7 @@ def run(self) -> None:
parallel=True,
)
)

# Starting node `node_rank`` ...
self.ws[-1].start()

Expand Down
Loading