-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[CLI] add support for cluster management #13835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
tchaton
merged 51 commits into
Lightning-AI:master
from
nicolai86:nicolai86/cluster-management
Aug 2, 2022
Merged
Changes from 49 commits
Commits
Show all changes
51 commits
Select commit
Hold shift + click to select a range
ac3495c
e2e test skeleton for cluster CLI
nicolai86 e5b85a3
implement skeleton of cluster CLI
nicolai86 aff7ec2
working cluster listing
nicolai86 ce34ff0
migrate cluster creation
nicolai86 8a72873
add cluster deletion code
nicolai86 8d4d88c
migrate some cluster configuration into cmd_clusters
nicolai86 bd1eccb
refactor waiting for cluster state to be configurable
nicolai86 5941876
add basic unittests for wait for cluster state
nicolai86 483b993
add basic unit tests for cluster name
nicolai86 81b0706
add unit tests for cluster mgmt api
nicolai86 362c470
documentation update
nicolai86 cabf617
adjust wording
nicolai86 6378481
Merge branch 'master' into nicolai86/cluster-management
nicolai86 b23164d
change CLI to <lightning> <verb> <object>
nicolai86 ac8f0b5
more wording changes
nicolai86 fc5a1e6
adjust e2e tests
nicolai86 9330c85
refactor command names to be more uniform
nicolai86 d8074ec
update CHANGELOG
nicolai86 99355c3
fix tests not working when un-authenticated
nicolai86 e02e143
Update src/lightning_app/CHANGELOG.md
nicolai86 d1ad3c2
wording update
nicolai86 459bbfa
Merge branch 'master' into nicolai86/cluster-management
nicolai86 ef6748b
drop dependency on arrow
nicolai86 911411b
doc strings and variable renaming
nicolai86 1ad8438
add api unit tests
nicolai86 397131d
dropping environment variables not relevant to CLI
nicolai86 9ce60e1
split up CLI
nicolai86 a78e6b6
do not print raw cluster response
nicolai86 d60d28a
Merge branch 'master' into nicolai86/cluster-management
nicolai86 7766a7a
Merge branch 'master' into nicolai86/cluster-management
nicolai86 1597cfd
properly import commands from separate files
nicolai86 3da681d
Merge branch 'master' into nicolai86/cluster-management
nicolai86 e9b3900
PR feedback from jirka, thomas
nicolai86 9aa1b26
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 3e6e1e5
Update src/lightning_app/cli/cmd_clusters.py
nicolai86 93ee4aa
Update src/lightning_app/cli/cmd_clusters.py
nicolai86 54a3969
Update src/lightning_app/cli/cmd_clusters.py
nicolai86 f1d9d4a
Update src/lightning_app/cli/cmd_clusters.py
nicolai86 c85e301
Update src/lightning_app/cli/cmd_clusters.py
nicolai86 a0489d6
Update src/lightning_app/cli/cmd_clusters.py
nicolai86 164a63a
Update src/lightning_app/cli/cmd_clusters.py
nicolai86 58046c5
Update src/lightning_app/cli/cmd_clusters.py
nicolai86 98346b4
Update src/lightning_app/cli/cmd_clusters.py
nicolai86 4b49287
Update src/lightning_app/utilities/openapi.py
nicolai86 f50670d
Update src/lightning_app/utilities/openapi.py
nicolai86 22868f3
Update src/lightning_app/utilities/openapi.py
nicolai86 1b8c19a
Update src/lightning_app/utilities/openapi.py
nicolai86 37fd580
more python comment style changes
nicolai86 1832d9d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] a6d4323
remove instance types list
nicolai86 b89ff6e
Merge branch 'master' into nicolai86/cluster-management
awaelchli File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,240 @@ | ||
import json | ||
import re | ||
import time | ||
from datetime import datetime | ||
|
||
import click | ||
from lightning_cloud.openapi import ( | ||
V1AWSClusterDriverSpec, | ||
V1ClusterDriver, | ||
V1ClusterPerformanceProfile, | ||
V1ClusterSpec, | ||
V1CreateClusterRequest, | ||
V1InstanceSpec, | ||
V1KubernetesClusterDriver, | ||
) | ||
from lightning_cloud.openapi.models import Externalv1Cluster, V1ClusterState, V1ClusterType | ||
from rich.console import Console | ||
from rich.table import Table | ||
from rich.text import Text | ||
|
||
from lightning_app.cli.core import Formatable | ||
from lightning_app.utilities.network import LightningClient | ||
from lightning_app.utilities.openapi import create_openapi_object, string2dict | ||
|
||
CLUSTER_STATE_CHECKING_TIMEOUT = 60 | ||
MAX_CLUSTER_WAIT_TIME = 5400 | ||
|
||
|
||
class AWSClusterManager: | ||
"""AWSClusterManager implements API calls specific to Lightning AI BYOC compute clusters when the AWS provider | ||
is selected as the backend compute.""" | ||
|
||
def __init__(self): | ||
self.api_client = LightningClient() | ||
|
||
def create( | ||
self, | ||
cost_savings: bool = False, | ||
cluster_name: str = None, | ||
role_arn: str = None, | ||
region: str = "us-east-1", | ||
external_id: str = None, | ||
instance_types: [str] = [], | ||
edit_before_creation: bool = False, | ||
wait: bool = False, | ||
): | ||
"""request Lightning AI BYOC compute cluster creation. | ||
|
||
Args: | ||
cost_savings: Specifies if the cluster uses cost savings mode | ||
cluster_name: The name of the cluster to be created | ||
role_arn: AWS IAM Role ARN used to provision resources | ||
region: AWS region containing compute resources | ||
external_id: AWS IAM Role external ID | ||
instance_types: AWS instance types supported by the cluster | ||
edit_before_creation: Enables interactive editing of requests before submitting it to Lightning AI. | ||
wait: Waits for the cluster to be in a RUNNING state. Only use this for debugging. | ||
""" | ||
performance_profile = V1ClusterPerformanceProfile.DEFAULT | ||
nicolai86 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if cost_savings: | ||
nicolai86 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"""In cost saving mode the number of compute nodes is reduced to one, reducing the cost for clusters | ||
with low utilization.""" | ||
performance_profile = V1ClusterPerformanceProfile.COST_SAVING | ||
|
||
body = V1CreateClusterRequest( | ||
name=cluster_name, | ||
spec=V1ClusterSpec( | ||
cluster_type=V1ClusterType.BYOC, | ||
performance_profile=performance_profile, | ||
driver=V1ClusterDriver( | ||
kubernetes=V1KubernetesClusterDriver( | ||
aws=V1AWSClusterDriverSpec( | ||
region=region, | ||
role_arn=role_arn, | ||
external_id=external_id, | ||
instance_types=[V1InstanceSpec(name=x) for x in instance_types], | ||
) | ||
) | ||
), | ||
), | ||
) | ||
new_body = body | ||
if edit_before_creation: | ||
after = click.edit(json.dumps(body.to_dict(), indent=4)) | ||
if after is not None: | ||
new_body = create_openapi_object(string2dict(after), body) | ||
if new_body == body: | ||
click.echo("cluster unchanged") | ||
|
||
resp = self.api_client.cluster_service_create_cluster(body=new_body) | ||
if wait: | ||
_wait_for_cluster_state(self.api_client, resp.id, V1ClusterState.RUNNING) | ||
|
||
click.echo(f"${resp.id} cluster is ${resp.status.phase}") | ||
|
||
def list(self): | ||
nicolai86 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
resp = self.api_client.cluster_service_list_clusters(phase_not_in=[V1ClusterState.DELETED]) | ||
console = Console() | ||
console.print(ClusterList(resp.clusters).as_table()) | ||
|
||
def delete(self, cluster_id: str = None, force: bool = False, wait: bool = False): | ||
if force: | ||
click.echo( | ||
""" | ||
Deletes a BYOC cluster. Lightning AI removes cluster artifacts and any resources running on the cluster.\n | ||
WARNING: Deleting a cluster does not clean up any resources managed by Lightning AI.\n | ||
Check your cloud provider to verify that existing cloud resources are deleted. | ||
""" | ||
) | ||
click.confirm("Do you want to continue?", abort=True) | ||
|
||
self.api_client.cluster_service_delete_cluster(id=cluster_id, force=force) | ||
click.echo("Cluster deletion triggered successfully") | ||
|
||
if wait: | ||
_wait_for_cluster_state(self.api_client, cluster_id, V1ClusterState.DELETED) | ||
|
||
|
||
class ClusterList(Formatable): | ||
def __init__(self, clusters: [Externalv1Cluster]): | ||
self.clusters = clusters | ||
|
||
def as_json(self) -> str: | ||
return json.dumps(self.clusters) | ||
|
||
def as_table(self) -> Table: | ||
table = Table("id", "name", "type", "status", "created", show_header=True, header_style="bold green") | ||
phases = { | ||
V1ClusterState.QUEUED: Text("queued", style="bold yellow"), | ||
V1ClusterState.PENDING: Text("pending", style="bold yellow"), | ||
V1ClusterState.RUNNING: Text("running", style="bold green"), | ||
V1ClusterState.FAILED: Text("failed", style="bold red"), | ||
V1ClusterState.DELETED: Text("deleted", style="bold red"), | ||
} | ||
|
||
cluster_type_lookup = { | ||
V1ClusterType.BYOC: Text("byoc", style="bold yellow"), | ||
V1ClusterType.GLOBAL: Text("lightning-cloud", style="bold green"), | ||
} | ||
for cluster in self.clusters: | ||
cluster: Externalv1Cluster | ||
status = phases[cluster.status.phase] | ||
if cluster.spec.desired_state == V1ClusterState.DELETED and cluster.status.phase != V1ClusterState.DELETED: | ||
status = Text("terminating", style="bold red") | ||
|
||
# this guard is necessary only until 0.3.93 releases which includes the `created_at` | ||
# field to the external API | ||
created_at = datetime.now() | ||
if hasattr(cluster, "created_at"): | ||
created_at = cluster.created_at | ||
|
||
table.add_row( | ||
cluster.id, | ||
cluster.name, | ||
cluster_type_lookup.get(cluster.spec.cluster_type, Text("unknown", style="red")), | ||
status, | ||
created_at.strftime("%Y-%m-%d") if created_at else "", | ||
) | ||
return table | ||
|
||
|
||
def _wait_for_cluster_state( | ||
api_client: LightningClient, | ||
cluster_id: str, | ||
target_state: V1ClusterState, | ||
max_wait_time: int = MAX_CLUSTER_WAIT_TIME, | ||
check_timeout: int = CLUSTER_STATE_CHECKING_TIMEOUT, | ||
): | ||
nicolai86 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"""_wait_for_cluster_state waits until the provided cluster has reached a desired state, or failed. | ||
|
||
Args: | ||
api_client: LightningClient used for polling | ||
cluster_id: Specifies the cluster to wait for | ||
target_state: Specifies the desired state the target cluster needs to meet | ||
max_wait_time: Maximum duration to wait (in seconds) | ||
check_timeout: duration between polling for the cluster state (in seconds) | ||
""" | ||
start = time.time() | ||
elapsed = 0 | ||
while elapsed < max_wait_time: | ||
cluster_resp = api_client.cluster_service_list_clusters() | ||
new_cluster = None | ||
for clust in cluster_resp.clusters: | ||
if clust.id == cluster_id: | ||
new_cluster = clust | ||
break | ||
if new_cluster is not None: | ||
if new_cluster.status.phase == target_state: | ||
break | ||
elif new_cluster.status.phase == V1ClusterState.FAILED: | ||
raise click.ClickException(f"Cluster {cluster_id} is in failed state.") | ||
time.sleep(check_timeout) | ||
elapsed = time.time() - start | ||
else: | ||
raise click.ClickException("Max wait time elapsed") | ||
|
||
|
||
def _check_cluster_name_is_valid(_ctx, _param, value): | ||
pattern = r"^(?!-)[a-z0-9-]{1,63}(?<!-)$" | ||
if not re.match(pattern, value): | ||
raise click.ClickException( | ||
"""The cluster name is invalid. | ||
Cluster names can only contain lowercase letters, numbers, and periodic hyphens ( - ). | ||
Provide a cluster name using valid characters and try again.""" | ||
) | ||
return value | ||
|
||
|
||
_default_instance_types = frozenset( | ||
nicolai86 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
[ | ||
"g2.8xlarge", | ||
"g3.16xlarge", | ||
"g3.4xlarge", | ||
"g3.8xlarge", | ||
"g3s.xlarge", | ||
"g4dn.12xlarge", | ||
"g4dn.16xlarge", | ||
"g4dn.2xlarge", | ||
"g4dn.4xlarge", | ||
"g4dn.8xlarge", | ||
"g4dn.metal", | ||
"g4dn.xlarge", | ||
"p2.16xlarge", | ||
"p2.8xlarge", | ||
"p2.xlarge", | ||
"p3.16xlarge", | ||
"p3.2xlarge", | ||
"p3.8xlarge", | ||
"p3dn.24xlarge", | ||
# "p4d.24xlarge", # currently not supported | ||
"t2.large", | ||
"t2.medium", | ||
"t2.xlarge", | ||
"t2.2xlarge", | ||
"t3.large", | ||
"t3.medium", | ||
"t3.xlarge", | ||
"t3.2xlarge", | ||
] | ||
) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
import abc | ||
|
||
from rich.table import Table | ||
|
||
|
||
class Formatable(abc.ABC): | ||
@abc.abstractmethod | ||
def as_table(self) -> Table: | ||
pass | ||
|
||
@abc.abstractmethod | ||
def as_json(self) -> str: | ||
pass |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
import click | ||
|
||
from lightning_app.cli.cmd_clusters import _check_cluster_name_is_valid, _default_instance_types, AWSClusterManager | ||
|
||
|
||
@click.group("create") | ||
def create(): | ||
"""Create Lightning AI BYOC managed resources.""" | ||
pass | ||
|
||
|
||
@create.command("cluster") | ||
@click.argument("cluster_name", callback=_check_cluster_name_is_valid) | ||
@click.option("--provider", "provider", type=str, default="aws", help="cloud provider to be used for your cluster") | ||
@click.option("--external-id", "external_id", type=str, required=True) | ||
@click.option( | ||
"--role-arn", "role_arn", type=str, required=True, help="AWS role ARN attached to the associated resources." | ||
) | ||
@click.option( | ||
"--region", | ||
"region", | ||
type=str, | ||
required=False, | ||
default="us-east-1", | ||
help="AWS region that is used to host the associated resources.", | ||
) | ||
@click.option( | ||
"--instance-types", | ||
"instance_types", | ||
type=str, | ||
required=False, | ||
default=",".join(_default_instance_types), | ||
help="Instance types that you want to support, for computer jobs within the cluster.", | ||
) | ||
@click.option( | ||
"--cost-savings", | ||
"cost_savings", | ||
type=bool, | ||
required=False, | ||
default=False, | ||
is_flag=True, | ||
help=""""Use this flag to ensure that the cluster is created with a profile that is optimized for cost savings. | ||
This makes runs cheaper but start-up times may increase.""", | ||
) | ||
@click.option( | ||
"--edit-before-creation", | ||
default=False, | ||
is_flag=True, | ||
help="Edit the cluster specs before submitting them to the API server.", | ||
) | ||
@click.option( | ||
"--wait", | ||
"wait", | ||
type=bool, | ||
required=False, | ||
default=False, | ||
is_flag=True, | ||
help="Enabling this flag makes the CLI wait until the cluster is running.", | ||
) | ||
def create_cluster( | ||
nicolai86 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
cluster_name: str, | ||
region: str, | ||
role_arn: str, | ||
external_id: str, | ||
provider: str, | ||
instance_types: str, | ||
edit_before_creation: bool, | ||
cost_savings: bool, | ||
wait: bool, | ||
**kwargs, | ||
): | ||
"""Create a Lightning AI BYOC compute cluster with your cloud provider credentials.""" | ||
if provider != "aws": | ||
click.echo("Only AWS is supported for now. But support for more providers is coming soon.") | ||
return | ||
cluster_manager = AWSClusterManager() | ||
cluster_manager.create( | ||
cluster_name=cluster_name, | ||
region=region, | ||
role_arn=role_arn, | ||
external_id=external_id, | ||
instance_types=instance_types.split(","), | ||
edit_before_creation=edit_before_creation, | ||
cost_savings=cost_savings, | ||
wait=wait, | ||
) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.