Skip to content

RFC: Remove num_nodes Trainer argument and infer world size from cluster environment directly #14078

@awaelchli

Description

@awaelchli

🚀 Feature

Remove the redundant num_nodes Trainer argument. Knowing the number of nodes is not required, and the world size is provided by the cluster environment anyway.

Motivation

Users have always struggled getting multi-node training to work because there exists many points of failures in the configurations. One of them is setting the number of devices and nodes, where in Lightning you are required to set devices=n and num_nodes=m to match the world_size=n*m set by the cluster or launcher (torchrun, slurm, etc.). When num_nodes is not set correctly, it leads to too few processes joining the group and thus the program gets stuck.

We periodically receive issues and questions where it becomes clear users forget to set this values correctly (#13804, #14022, #10098, #8707, #7429, #6206 to name a few). But technically, Lightning never requires to know the number of nodes, only the world size. This value can be read from the ClusterEnvironment directly.

The conclusion: num_nodes in the Trainer is redundant.

Another benefit of not having to specify the number of nodes is that the cluster can dynamically allocate nodes and devices based on available resources. In the end, a user who wants to train on 8 GPUs does not care if it is over 2 nodes of 4 GPUs each or over 4 nodes with 2 GPUs each.

Historically, num_nodes was introduced before the ClusterEnvironment abstraction became the primary way of collecting information about the cluster.

Pitch

Deprecate the redundant num_nodes Trainer argument.

Pros:

  • One less thing for users to get wrong
  • One less step to explain in the docs
  • Code simplifies
  • It is no longer required to know the number of nodes ahead of time

Cons:

  • None currently known

Alternatives

Keep it, but validate num_nodes setting against cluster environment (#10107)

Additional context

#13506, #7361, #13605


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

  • Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

  • Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging PyTorch Lightning, Transformers, and Hydra.

cc @tchaton @rohitgr7 @justusschock @kaushikb11 @awaelchli @akihironitta @ananthsub @Borda

Metadata

Metadata

Assignees

Type

No type

Projects

Relationships

None yet

Development

No branches or pull requests

Issue actions