Skip to content

GPUs: take device offline when unhealthy (build logic in go-nvlib) #360

@jgehrcke

Description

@jgehrcke

@klueska originally wrote:

We could start with the same health-check logic that we have in the k8s-device-plugin. This needs a major overhaul, but it is better than having nothing.

Once 26 is complete, we can then backport what was done here to the k8s-dra-driver repo as well.

Or better yet, we should build the logic for handling health checks in an abstract way in go-nvlib so that both projects can take advantage of it.

internal ref: cnt/issues/90

Metadata

Metadata

Assignees

Labels

featureissue/PR that proposes a new feature or functionalityrobustnessissue/pr: edge cases & fault tolerance

Type

No type

Projects

Status

In Progress

Relationships

None yet

Development

No branches or pull requests

Issue actions