-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[App] Fixed Multi Node and add examples #15557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
36 commits
Select commit
Hold shift + click to select a range
3e187d1
update
tchaton 092b36a
update
tchaton fcb2ea2
update
tchaton 6481338
Merge branch 'master' into add_multi_node_examples
tchaton e1271ce
update
tchaton 478d0f0
Merge branch 'add_multi_node_examples' of https://github.com/Lightnin…
tchaton 0c5a079
update
tchaton 804c5cb
update
tchaton baf1cae
update
tchaton a393f58
update
tchaton 38f1c72
update
tchaton 4ddb3ae
update
tchaton ed93320
update
tchaton 402b6fd
update
tchaton dece823
update
tchaton 4b7e8af
update
tchaton db336d3
update
tchaton 2cd0d54
update
tchaton 7da57cd
update
tchaton 651590e
update
tchaton 17ac6db
update
tchaton 589ff92
update
tchaton d221d35
update
tchaton 53597c7
Merge branch 'master' into add_multi_node_examples
tchaton fa6def5
update
tchaton 0adcdf3
Merge branch 'add_multi_node_examples' of https://github.com/Lightnin…
tchaton f2fa720
update
tchaton 7778c58
update
tchaton c005373
update
tchaton f8fda2e
update
tchaton 00119f6
update
tchaton 0f4e5e5
update
tchaton 1706574
update
tchaton 060f726
update
tchaton e9c4332
Merge branch 'master' into add_multi_node_examples
lantiga 8323b04
update
tchaton File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# Lightning & Multi Node Training | ||
|
||
Lightning supports makes multi-node training simple by providing a simple interface to orchestrate compute and data. | ||
|
||
## Multi Node with raw PyTorch | ||
|
||
You can run the multi-node raw PyTorch by running the following commands. | ||
|
||
```bash | ||
lightning run app app_torch_work.py | ||
``` | ||
|
||
## Multi Node with raw PyTorch + Lite | ||
|
||
You can run the multi-node raw PyTorch and Lite by running the following commands. | ||
|
||
```bash | ||
lightning run app app_lite_work.py | ||
``` | ||
|
||
## Multi Node with PyTorch Lightning | ||
|
||
Lightning supports running PyTorch Lightning from a script or within a Lightning Work. | ||
|
||
### Multi Node PyTorch Lightning Script | ||
|
||
```bash | ||
lightning run app app_pl_script.py | ||
``` | ||
|
||
### Multi Node PyTorch Lightning Work | ||
|
||
```bash | ||
lightning run app app_pl_work.py | ||
``` | ||
|
||
## Multi Node with any frameworks | ||
|
||
```bash | ||
lightning run app app_generic_work.py | ||
``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
import os | ||
|
||
import torch | ||
|
||
import lightning as L | ||
from lightning.app.components import MultiNode | ||
from lightning.lite import LightningLite | ||
|
||
|
||
def distributed_train(lite: LightningLite): | ||
# 1. Prepare distributed model and optimizer | ||
model = torch.nn.Linear(32, 2) | ||
optimizer = torch.optim.SGD(model.parameters(), lr=0.01) | ||
model, optimizer = lite.setup(model, optimizer) | ||
criterion = torch.nn.MSELoss() | ||
|
||
# 2. Train the model for 50 steps. | ||
for step in range(50): | ||
model.zero_grad() | ||
x = torch.randn(64, 32).to(lite.device) | ||
output = model(x) | ||
loss = criterion(output, torch.ones_like(output)) | ||
print(f"global_rank: {lite.global_rank} step: {step} loss: {loss}") | ||
lite.backward(loss) | ||
optimizer.step() | ||
|
||
# 3. Verify all processes have the same weights at the end of training. | ||
weight = model.module.weight.clone() | ||
torch.distributed.all_reduce(weight) | ||
assert torch.equal(model.module.weight, weight / lite.world_size) | ||
|
||
print("Multi Node Distributed Training Done!") | ||
|
||
|
||
class PyTorchDistributed(L.LightningWork): | ||
def run( | ||
self, | ||
main_address: str, | ||
main_port: int, | ||
num_nodes: int, | ||
node_rank: int, | ||
): | ||
|
||
os.environ["MASTER_ADDR"] = main_address | ||
os.environ["MASTER_PORT"] = str(main_port) | ||
os.environ["NODE_RANK"] = str(node_rank) | ||
|
||
lite = LightningLite(accelerator="auto", devices="auto", strategy="ddp_spawn", num_nodes=num_nodes) | ||
lite.launch(function=distributed_train) | ||
|
||
|
||
compute = L.CloudCompute("gpu-fast-multi") # 4xV100 | ||
app = L.LightningApp( | ||
MultiNode( | ||
PyTorchDistributed, | ||
num_nodes=2, | ||
cloud_compute=compute, | ||
) | ||
) |
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
import os | ||
|
||
import lightning as L | ||
from lightning.app.components import MultiNode | ||
from lightning.pytorch.demos.boring_classes import BoringModel | ||
|
||
|
||
class PyTorchLightningDistributed(L.LightningWork): | ||
def run( | ||
self, | ||
main_address: str, | ||
main_port: int, | ||
num_nodes: int, | ||
node_rank: int, | ||
): | ||
os.environ["MASTER_ADDR"] = main_address | ||
os.environ["MASTER_PORT"] = str(main_port) | ||
os.environ["NODE_RANK"] = str(node_rank) | ||
|
||
model = BoringModel() | ||
trainer = L.Trainer( | ||
max_epochs=10, | ||
devices="auto", | ||
accelerator="auto", | ||
num_nodes=num_nodes, | ||
strategy="ddp_spawn", # Only spawn based strategies are supported for now. | ||
) | ||
trainer.fit(model) | ||
|
||
|
||
compute = L.CloudCompute("gpu-fast-multi") # 4xV100 | ||
app = L.LightningApp( | ||
MultiNode( | ||
PyTorchLightningDistributed, | ||
num_nodes=2, | ||
cloud_compute=compute, | ||
) | ||
) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
import torch | ||
from torch.nn.parallel.distributed import DistributedDataParallel | ||
|
||
import lightning as L | ||
from lightning.app.components import MultiNode | ||
|
||
|
||
def distributed_train(local_rank: int, main_address: str, main_port: int, num_nodes: int, node_rank: int, nprocs: int): | ||
# 1. Setting distributed environment | ||
global_rank = local_rank + node_rank * nprocs | ||
world_size = num_nodes * nprocs | ||
|
||
if torch.distributed.is_available() and not torch.distributed.is_initialized(): | ||
torch.distributed.init_process_group( | ||
"nccl" if torch.cuda.is_available() else "gloo", | ||
rank=global_rank, | ||
world_size=world_size, | ||
init_method=f"tcp://{main_address}:{main_port}", | ||
) | ||
|
||
# 2. Prepare distributed model | ||
model = torch.nn.Linear(32, 2) | ||
device = torch.device(f"cuda:{local_rank}") if torch.cuda.is_available() else torch.device("cpu") | ||
device_ids = device if torch.cuda.is_available() else None | ||
model = DistributedDataParallel(model, device_ids=device_ids).to(device) | ||
|
||
# 3. Prepare loss and optimizer | ||
criterion = torch.nn.MSELoss() | ||
optimizer = torch.optim.SGD(model.parameters(), lr=0.01) | ||
|
||
# 4. Train the model for 50 steps. | ||
for step in range(50): | ||
model.zero_grad() | ||
x = torch.randn(64, 32).to(device) | ||
output = model(x) | ||
loss = criterion(output, torch.ones_like(output)) | ||
print(f"global_rank: {global_rank} step: {step} loss: {loss}") | ||
loss.backward() | ||
optimizer.step() | ||
|
||
# 5. Verify all processes have the same weights at the end of training. | ||
weight = model.module.weight.clone() | ||
torch.distributed.all_reduce(weight) | ||
assert torch.equal(model.module.weight, weight / world_size) | ||
|
||
print("Multi Node Distributed Training Done!") | ||
|
||
|
||
class PyTorchDistributed(L.LightningWork): | ||
def run( | ||
self, | ||
main_address: str, | ||
main_port: int, | ||
num_nodes: int, | ||
node_rank: int, | ||
): | ||
nprocs = torch.cuda.device_count() if torch.cuda.is_available() else 1 | ||
torch.multiprocessing.spawn( | ||
distributed_train, args=(main_address, main_port, num_nodes, node_rank, nprocs), nprocs=nprocs | ||
) | ||
|
||
|
||
compute = L.CloudCompute("gpu-fast-multi") # 4xV100 | ||
app = L.LightningApp( | ||
MultiNode( | ||
PyTorchDistributed, | ||
num_nodes=2, | ||
cloud_compute=compute, | ||
) | ||
) |
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
from pytorch_lightning import Trainer | ||
from pytorch_lightning.demos.boring_classes import BoringModel | ||
import lightning as L | ||
from lightning.pytorch.demos.boring_classes import BoringModel | ||
|
||
if __name__ == "__main__": | ||
model = BoringModel() | ||
trainer = Trainer(max_epochs=1) | ||
trainer = L.Trainer(max_epochs=1) | ||
trainer.fit(model) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.