[App] Expose Run Work Executor #15561

tchaton · 2022-11-06T21:33:56Z

What does this PR do?

This PR exposes RunWorkExecutor and provides some default one for MultiNode Component.

Example with PyTorch. Processes are automatically spawned for the users.

import torch
from torch.nn.parallel.distributed import DistributedDataParallel

import lightning as L
from lightning.app.components import PyTorchSpawnMultiNode


class PyTorchDistributed(L.LightningWork):

    # Note: Only staticmethod are support for now with `PyTorchSpawnMultiNode`
    @staticmethod
    def run(
        world_size: int,
        node_rank: int,
        global_rank: str,
        local_rank: int,
    ):
        # 1. Prepare distributed model
        model = torch.nn.Linear(32, 2)
        device = torch.device(f"cuda:{local_rank}") if torch.cuda.is_available() else torch.device("cpu")
        device_ids = device if torch.cuda.is_available() else None
        model = DistributedDataParallel(model, device_ids=device_ids).to(device)

        # 2. Prepare loss and optimizer
        criterion = torch.nn.MSELoss()
        optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

        # 3. Train the model for 50 steps.
        for step in range(50):
            model.zero_grad()
            x = torch.randn(64, 32).to(device)
            output = model(x)
            loss = criterion(output, torch.ones_like(output))
            print(f"global_rank: {global_rank} step: {step} loss: {loss}")
            loss.backward()
            optimizer.step()


compute = L.CloudCompute("gpu-fast-multi")  # 4 x V100
app = L.LightningApp(
    PyTorchSpawnMultiNode(
        PyTorchDistributed,
        num_nodes=2,
        cloud_compute=compute,
    )
)

Example with LightningLite.

import torch

import lightning as L
from lightning.app.components import LiteMultiNode
from lightning.lite import LightningLite


class LitePyTorchDistributed(L.LightningWork):
    @staticmethod
    def run():
        # 1. Create LightningLite.
        lite = LightningLite(strategy="ddp", precision="bf16")

        # 2. Prepare distributed model and optimizer.
        model = torch.nn.Linear(32, 2)
        optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
        model, optimizer = lite.setup(model, optimizer)
        criterion = torch.nn.MSELoss()

        # 3. Train the model for 50 steps.
        for step in range(50):
            model.zero_grad()
            x = torch.randn(64, 32).to(lite.device)
            output = model(x)
            loss = criterion(output, torch.ones_like(output))
            print(f"global_rank: {lite.global_rank} step: {step} loss: {loss}")
            lite.backward(loss)
            optimizer.step()


app = L.LightningApp(
    LiteMultiNode(
        LitePyTorchDistributed,
        cloud_compute=L.CloudCompute("gpu-fast-multi"),  # 4 x V100,
        num_nodes=2,
    )
)

import lightning as L
from lightning.app.components import PyTorchLightningMultiNode
from lightning.pytorch.demos.boring_classes import BoringModel


class PyTorchLightningDistributed(L.LightningWork):
    @staticmethod
    def run():
        model = BoringModel()
        trainer = L.Trainer(
            max_epochs=10,
            strategy="ddp",
        )
        trainer.fit(model)


compute = L.CloudCompute("gpu-fast-multi")  # 4 x V100
app = L.LightningApp(
    PyTorchLightningMultiNode(
        PyTorchLightningDistributed,
        num_nodes=2,
        cloud_compute=compute,
    )
)

Fixes #<issue_number>

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

cc @Borda

…g-AI/lightning into add_multi_node_examples

justusschock

lgtm, a few minor comments

src/lightning_app/utilities/app_helpers.py

src/lightning_app/components/multi_node/lite.py

src/lightning_app/components/multi_node/pl.py

src/lightning_app/components/multi_node/pytorch_spawn.py

src/lightning_app/components/multi_node/lite.py

src/lightning_app/components/multi_node/__init__.py

requirements/app/examples.txt

src/lightning_app/core/work.py

src/lightning_app/utilities/app_helpers.py

tests/tests_app_examples/conftest.py

examples/app_multi_node/README.md

Co-authored-by: Adrian Wälchli <[email protected]>

…lightning into expose_work_runner

(cherry picked from commit f9a6573)

tchaton added 30 commits November 6, 2022 11:46

update

3e187d1

update

092b36a

update

fcb2ea2

Merge branch 'master' into add_multi_node_examples

6481338

update

e1271ce

Merge branch 'add_multi_node_examples' of https://github.com/Lightnin…

478d0f0

…g-AI/lightning into add_multi_node_examples

update

0c5a079

update

804c5cb

update

baf1cae

update

a393f58

update

38f1c72

update

4ddb3ae

update

ed93320

update

402b6fd

update

dece823

update

4b7e8af

update

db336d3

update

2cd0d54

update

7da57cd

update

651590e

update

17ac6db

update

589ff92

update

d221d35

Merge branch 'master' into add_multi_node_examples

53597c7

update

fa6def5

Merge branch 'add_multi_node_examples' of https://github.com/Lightnin…

0adcdf3

…g-AI/lightning into add_multi_node_examples

update

f2fa720

update

7778c58

update

c005373

update

f8fda2e

tchaton added 3 commits November 8, 2022 09:30

update

7975fa8

update

512bd73

Merge branch 'master' into expose_work_runner

9ac00b1

ethanwharris approved these changes Nov 8, 2022

View reviewed changes

mergify bot added the ready PRs ready to be merged label Nov 8, 2022

justusschock approved these changes Nov 8, 2022

View reviewed changes

Borda requested review from Borda and manskx and removed request for otaj, rohitgr7 and kaushikb11 November 8, 2022 10:05

awaelchli reviewed Nov 8, 2022

View reviewed changes

update

86e27c9

tchaton requested review from awaelchli and removed request for manskx November 8, 2022 10:47

tchaton added 3 commits November 8, 2022 10:49

update

6c03a31

update

ffea172

add note

e5a4880

awaelchli approved these changes Nov 8, 2022

View reviewed changes

examples/app_multi_node/README.md Outdated Show resolved Hide resolved

examples/app_multi_node/README.md Outdated Show resolved Hide resolved

tchaton and others added 5 commits November 8, 2022 12:11

Update examples/app_multi_node/README.md

8c59907

Co-authored-by: Adrian Wälchli <[email protected]>

Merge branch 'master' into expose_work_runner

74dfa0f

update

b157a21

Merge branch 'expose_work_runner' of https://github.com/Lightning-AI/…

e1bce24

…lightning into expose_work_runner

Merge branch 'master' into expose_work_runner

41b6b81

tchaton enabled auto-merge (squash) November 8, 2022 12:15

tchaton merged commit f9a6573 into master Nov 8, 2022

tchaton deleted the expose_work_runner branch November 8, 2022 12:55

Borda pushed a commit that referenced this pull request Nov 8, 2022

[App] Expose Run Work Executor (#15561)

66e7b89

(cherry picked from commit f9a6573)

lexierule pushed a commit that referenced this pull request Nov 10, 2022

[App] Expose Run Work Executor (#15561)

893c4f9

(cherry picked from commit f9a6573)

nicolai86 mentioned this pull request Dec 7, 2022

Fix typo in package publish action #15948

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[App] Expose Run Work Executor #15561

[App] Expose Run Work Executor #15561

Uh oh!

tchaton commented Nov 6, 2022 •

edited by github-actions bot

Loading

Uh oh!

justusschock left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[App] Expose Run Work Executor #15561

[App] Expose Run Work Executor #15561

Uh oh!

Conversation

tchaton commented Nov 6, 2022 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

Uh oh!

justusschock left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tchaton commented Nov 6, 2022 •

edited by github-actions bot

Loading