Hydra + DDP Fails for NeMo after Hydra refactor in 1.8

### Bug description

Related #15545

NeMo PR where regression was spotted https://github.com/NVIDIA/NeMo/pull/5353

After https://github.com/Lightning-AI/lightning/pull/11617 was merged and included in 1.8, this has caused NeMo to break with  DDP (as NeMo uses Hydra internally). I'm going to cross-paste the explanation from the above PR:

In this PR https://github.com/Lightning-AI/lightning/pull/11617 Lightning changed the way in which sub-processes are started in DDP. Instead of re-running the command (and passing env variables to set the rank), sub-processes now use a saved config yaml file auto-generated by Hydra, default stored to `.hydra` to pass the arguments via config.

In NeMo, **we disable the creation of this output subdirectory**, and we always set the run directory to the current working directory. This lets the experiment manager handle everything regarding checkpoints/logging directory instead of hydra.

The issue is that at running of sub-processes, the hydra runner is not aware of the experiment manager's directory choices. If we allow the sub directory to be created in the default `.hydra`, with the DDP Lightning code we'd start processes in the current working directory, each with a new folder (`.pl_hydra_local_rank_{rank}).

This issue is if you have multiple runs in the same repo, each of them will overwrite each other and it would be a mess.

I have been unable to come up with an elegant solution between NeMo and Lightning. 

- The easiest option would be to monkey-patch the class to the old behaviour with argv, however this doesn't future proof at all, if anything goes the opposite direction.

I may have missed something however, so if anyone has any other suggestions on how we can fix this please let me know!

cc @tchaton @justusschock @awaelchli @akihironitta @borda @titu1994 @ericharper

### How to reproduce the bug

Requires 2 devices (not sure if it has to be on the GPU though).

```
# install NeMo with Lightning 1.8 (WIP) support
pip install https://github.com/NVIDIA/NeMo/archive/refs/heads/feat/lightning_1.8_support.zip
```

Create config in `conf/config.yaml`:
```yaml
name: "QuartzNet15x5"
exp_manager:
  exp_dir: null
  name: ${name}
  create_tensorboard_logger: False
  create_checkpoint_callback: False
  create_wandb_logger: False
```

Create file with this code:
```python
import os

import torch
from nemo.core.config import hydra_runner
from nemo.utils.exp_manager import exp_manager
from pytorch_lightning import LightningModule, Trainer
from torch.utils.data import DataLoader, Dataset


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


@hydra_runner(config_path="conf", config_name="config")
def run(cfg):
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        max_epochs=1,
        devices=2,
        accelerator='gpu',
        logger=False,
        enable_checkpointing=False,
        strategy='ddp',
    )
    exp_manager(trainer=trainer, cfg=cfg.exp_manager)
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)


if __name__ == "__main__":
    run()
```

Run file.

```
python report.py
[NeMo W 2022-11-15 05:08:44 nemo_logging:349] /home/snarenthiran/anaconda3/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
    See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo I 2022-11-15 05:08:44 exp_manager:343] Experiments will be logged at /home/snarenthiran/nemo_experiments/QuartzNet15x5/2022-11-15_05-08-44
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Cannot find primary config 'config.yaml'. Check that it's in your config search path.

Config search path:
	provider=hydra, path=pkg://hydra.conf
	provider=main, path=file:///home/snarenthiran
	provider=schema, path=structured://

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hydra + DDP Fails for NeMo after Hydra refactor in 1.8 #15689

Bug description

How to reproduce the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hydra + DDP Fails for NeMo after Hydra refactor in 1.8 #15689

Description

Bug description

How to reproduce the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions