RFC: Remove `num_nodes` Trainer argument and infer world size from cluster environment directly

## 🚀 Feature

Remove the redundant `num_nodes` Trainer argument. Knowing the number of nodes is not required, and the world size is provided by the cluster environment anyway.  

### Motivation

Users have always struggled getting multi-node training to work because there exists many points of failures in the configurations. One of them is setting the number of devices and nodes, where in Lightning you are required to set `devices=n` and `num_nodes=m` to match the `world_size=n*m` set by the cluster or launcher (torchrun, slurm, etc.). When num_nodes is not set correctly, it leads to too few processes joining the group and thus the program gets stuck. 

We periodically receive issues and questions where it becomes clear users forget to set this values correctly (#13804, #14022, #10098, #8707, #7429,  #6206 to name a few). But technically, Lightning never requires to know the number of nodes, only the world size. This value can be read from the ClusterEnvironment directly. 

The conclusion: `num_nodes` in the Trainer is redundant. 

Another benefit of not having to specify the number of nodes is that the cluster can dynamically allocate nodes and devices based on available resources. In the end, a user who wants to train on 8 GPUs does not care if it is over 2 nodes of 4 GPUs each or over 4 nodes with 2 GPUs each. 

Historically, `num_nodes` was introduced before the ClusterEnvironment abstraction became the primary way of collecting information about the cluster. 


### Pitch

Deprecate the redundant `num_nodes` Trainer argument.

Pros:
- One less thing for users to get wrong
- One less step to explain in the docs
- Code simplifies
- It is no longer required to know the number of nodes ahead of time

Cons:
- None currently known


### Alternatives

Keep it, but validate num_nodes setting against cluster environment (#10107)

### Additional context

#13506, #7361, #13605


______________________________________________________________________

#### If you enjoy Lightning, check out our other projects! ⚡

- [**Metrics**](https://github.com/Lightning-AI/metrics): Machine learning metrics for distributed, scalable PyTorch applications.

- [**Lite**](https://pytorch-lightning.readthedocs.io/en/latest/starter/lightning_lite.html): enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

- [**Flash**](https://github.com/Lightning-AI/lightning-flash): The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

- [**Bolts**](https://github.com/Lightning-AI/lightning-bolts): Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

- [**Lightning Transformers**](https://github.com/Lightning-AI/lightning-transformers): Flexible interface for high-performance research using SOTA Transformers leveraging PyTorch Lightning, Transformers, and Hydra.


cc @tchaton @rohitgr7 @justusschock @kaushikb11 @awaelchli @akihironitta @ananthsub @borda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Remove `num_nodes` Trainer argument and infer world size from cluster environment directly #14078

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

If you enjoy Lightning, check out our other projects! ⚡

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Remove num_nodes Trainer argument and infer world size from cluster environment directly #14078

Description

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

If you enjoy Lightning, check out our other projects! ⚡

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

RFC: Remove `num_nodes` Trainer argument and infer world size from cluster environment directly #14078