Skip to content

Allow obtaining num_nodes from ClusterEnvironment #7361

@leezu

Description

@leezu

🚀 Feature

num_nodes must currently be specified manually by the user. However, the number of nodes is generally known in a cluster environment [1] and could be provided by and initialized from ClusterEnvironment

[1] For example $AWS_BATCH_JOB_NUM_NODES in https://docs.aws.amazon.com/batch/latest/userguide/multi-node-parallel-jobs.html

Pitch

https://github.com/PyTorchLightning/pytorch-lightning/blob/763a9a9495977b23cbd6a57f10253b662fd592a5/pytorch_lightning/plugins/training_type/ddp.py#L62-L75

could be updated to initialize num_nodes from ClusterEnvironment if ClusterEnvironment is provided and implements a num_nodes method.

cc @Borda @awaelchli @ananthsub

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions