KEP-5283: DRA: ResourceSlice Status for Device Health Tracking #5469

nojnhuh · 2025-08-07T21:22:33Z

One-line PR description: Add KEP to enable DRA drivers to store device health and other device status in the ResourceSlice

Issue link: DRA: ResourceSlice Status for Device Health Tracking #5283

Other comments:

KEP-5283: DRA: ResourceSlice Status for Device Health Tracking

k8s-ci-robot · 2025-08-07T21:22:39Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nojnhuh
Once this PR has been reviewed and has the lgtm label, please assign dchen1107 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

nojnhuh · 2025-08-07T21:24:36Z

This is far from complete, but I'd like some feedback on the Summary and Motivation sections to make sure the problem is scoped appropriately. I'd also like some help figuring out if the high-level ideas in the "Design Details" section are worth pursuing further or if one of the Alternatives seems like a better place to start.

bg-chun · 2025-08-09T05:29:08Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+#### Enabling Automated Remediation
+
+As a cluster administrator, I want to determine how to remediate unhealthy
+devices when different failure modes require different methods of remediation.


I’m wondering how DeviceUnhealthy would be consumed for this user story.
Don’t we need some kind of unhealthy reason or device conditions for this?

Yes, definitely. I was hoping to get some consensus on which high-level approach to start with before defining the entire API though.

Before one can determine how to remediate an issue, one needs to know what are the issues.

I think both of those are rather device specific, but it may be possible to come up with some broad categories, and how those might be remediated:

Device running too hot / fan not working

Taint device to reduce its load

Notify admin to check cooling

Device memory (ECC) errors

If recurring non-recoverable ones, taint device and notify admin to check/replace memory

Device power delivery issues

Taint device to reduce its load

Notify admin to check PSU

Device power usage throttling

If frequent, taint device to reduce its load and notify admin to check device FW power limits

Overuse of shared device or workload OOMs

Taint device to reduce its load

If recurring frequently, notify admin to check workload resource requests

Device link quality / stability issues

Prefer devices with better link quality => resource request should specify required link BW

If severe enough, ban multi-device workloads and notify admin to investigate

Specific workload hangs / increased user-space driver error counters

Stop scheduling that workload (it may use buggy user-space driver or use that inccorrectly)

Alert admin / dev to investigate that workload

Old / buggy device FW

If there's a set of workloads that work, and do not work correctly with that, use taints

Schedule FW upgrade, and taint device during upgrade

Device hangs / increased kernel driver / FW / HW error counters

Reset specific device part (e.g. compute)

Drain device and reset it

With too many device resets / error increases, taint device and alert admin

Drain all devices on same bus and reset bus

Drain whole node and reset it

Schedule device firmware update

Schedule device replacement

(First ones can be done by (kernel) driver automatically, last ones require admin)

everpeace · 2025-08-12T14:47:26Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+// DeviceHealthStatus represents the health of a device as observed by the driver.
+type DeviceHealthStatus struct {
+	// State is the overall health status of a device.
+	//
+	// +required
+	State DeviceHealthState `json:"state"`
+}


How about introducing a bit more info?? I think we can borrow several fields from PodCondition. For example:

Suggested change

// DeviceHealthStatus represents the health of a device as observed by the driver.

type DeviceHealthStatus struct {

// State is the overall health status of a device.

//

// +required

State DeviceHealthState `json:"state"`

}

// DeviceHealthStatus represents the health of a device as observed by the driver.

type DeviceHealthStatus struct {

// State is the overall health status of a device.

//

// +required

State DeviceHealthState `json:"state"`

// Reason is the reason of this device health. It could be helpful especially when the state is "Unhealthy".

// +optional

Reason string `json:"reason"`

// LastTransitionTime is the last time the device health transitioned from one state to another.

// +required

LastTransitionTime string `json:"lastTransitionTime"`

// LastReportedTime is the last reported time for the device health from the driver.

// +required

LastReportedTime string `json:"lastReportedTime"`

}

Yes, more info like this is necessary for this to be useful. I was hoping to get some feedback on if this high-level approach is worth pursuing or if one of the alternatives listed below is a better place to start getting into more of the details of the API.

Add TODO to fill out API

Jpsassine · 2025-08-12T22:27:50Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+}
+
+// DeviceHealthStatus represents the health of a device as observed by the driver.
+type DeviceHealthStatus struct {


The dra-health/v1alpha1 gRPC service implemented in #130606 already provides a stream of health updates from the DRA plugin to the Kubelet. This same gRPC service could be leveraged as the SoT for populating this new ResourceSlice.status field.

PR kubernetes/kubernetes#130606 introduced

type DeviceHealthStatus string const ( // DeviceHealthStatusHealthy represents a healthy device. DeviceHealthStatusHealthy DeviceHealthStatus = "Healthy" // DeviceHealthStatusUnhealthy represents an unhealthy device. DeviceHealthStatusUnhealthy DeviceHealthStatus = "Unhealthy" // DeviceHealthStatusUnknown represents a device with unknown health status. DeviceHealthStatusUnknown DeviceHealthStatus = "Unknown" )

@Jpsassine Does that status only surface for devices that are currently allocated to a Pod? If an unallocated device becomes unhealthy, is that visible anywhere in the Kubernetes API?

@nojnhuh Yes, so the health status from my PR surfaces health only for devices that are currently allocated to a Pod, which is reported via the new pod.status.containerStatuses.allocatedResourcesStatus field.

However, it seems this KEP-5283 could exactly address this visibility gap by adding the health status of all the devices to the resource slice.

Regardless of what we surface today, the DRA plugins who implement the new gRPC service DRAResourceHealthwill be streaming all device healths associated with it.

@Jpsassine How might we expose the health of devices that are accessible from multiple Nodes, like network attached devices? Does the kubelet on each Node compute the health of the device separately? Is it possible that two Nodes might have differing opinions on the health of the same device? I'm wondering if this KEP would need to define a way to express the health of a device with respect to each Node that could attach it.

I believe the DRA driver is the source of truth for device health here, not the Kubelet. In the architecture I implemented for KEP-4680, the kubelet acts as a client that consumes health status streamed from the node-local DRA plugin via the DRAResourceHealth gRPC service. This design inherently handles the possibility of differing health perspectives between nodes(although, I don't see how there could be a legitimate discrepancy of the same device health between nodes). Since a ResourceSlice is published by the DRA driver running on a specific node, the health status it contains would naturally reflect the device's condition from that node's perspective.

Example assuming the device healths are used to populate ResourceSlice device statuses:

If a network attached device is experiencing issues from Node A, the DRA driver on Node A would reoprt it as Unhealthy in the ResourceSlice for that node.

Simultaneously, if the same device is accessible from Node B, the driver on Node B would report it as Healthy in its ResourceSlice.

Although, I think this would be odd, it shows that the current model should account for this scenario where one node has the same device as healthy and another as unhealthy.

@SergeyKanzhelev, please correct me if I am wrong, but to the best of my understanding this is how the device health works with DRA now.

Sorry, to clarify I meant that the kubelet would "compute the health of the device" by invoking the DRA driver. In that way, could where the kubelet currently updates Pod status be extended to also update the ResourceSlice status? Is something like that what you had in mind?

If a network attached device is experiencing issues from Node A, the DRA driver on Node A would reoprt it as Unhealthy in the ResourceSlice for that node.

Simultaneously, if the same device is accessible from Node B, the driver on Node B would report it as Healthy in its ResourceSlice.

I would normally expect that device in this case to be represented in only one ResourceSlice which contains a nodeSelector matching multiple Nodes or allNodes instead of a singular nodeName, so a single "Healthy" signal for a device like that couldn't capture that entire context that the device is currently accessible from one Node but not another.

Storing the status with respect to each Node for each device will get costly though in large clusters where many devices are accessible from many different Nodes. I suppose we could consider that a device identified as "unhealthy" by any DRA driver instance is considered "unhealthy" overall, but then I'm not sure who/what should be responsible for aggregating all those results and determining the final status and updating the ResourceSlice if each kubelet can't do that by itself.

Making kubelet responsible for updating ResourceSlice status is not going to work for network-attached devices because there is no single kubelet instance which is responsible for those.

You both are right. Making the kubelet responsible for updating ResourceSlice.status won't work for network attached devices or multi-node devices. My work in KEP-4680 was intentionally focused on the node level problem of exposing health of a device actively in use by a Pod via the PodStatus.

For the broader goal in KEP-5283, since we want to expose the health of unallocated and allocated devices, the responsibility should be on a cluster level component like the DRA driver's controller.

How I see it:

Node level dra drivers stream health for all of their devices via the new DRAResourceHealth gRPC from KEP-4680.

Kubelet consumes this stream only to update the PodStatus of its local pods

The DRA driver's central controller should aggregate these streams from all the nodes and would be responsible for writing the health state to the ResourceSlice.status

Essentially, only leverage the new gRPC stream and health data inflow from DRA drivers, but not on the Kubelet's updating of status.

ArangoGutierrez · 2025-08-13T09:42:52Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+some action to restore the device to a healthy state. This KEP defines a
+standard way to determine whether or not a device is considered healthy.


This sounds like a bigger scope than the KEP title.

Suggested change

some action to restore the device to a healthy state. This KEP defines a

standard way to determine whether or not a device is considered healthy.

Some action is r to restore the device to a healthy state. This KEP proposes

A new entry to the ResourceSlice object to allow DRA drivers to report whether or not a device is considered healthy.

I think I would prefer to change the title then if this statement is too far off from it.

@johnbelamaric Would it be appropriate to retitle #5283 something like "DRA: Device Health Status" that doesn't imply any particular solution like adding a new status field in ResourceSlice but specific enough to differentiate it from #4680?

The "summary" section condenses the KEP, so it the proposal is to extend ResourceSlice status then it's important to mention here because it clarifies what readers should expect when diving into the details.

But the "motivation" section shouldn't include that yet because it is an implementation choice. There might be other ways to solve the problems described there.

Thanks, updated the summary to mention the high-level approach.

ArangoGutierrez · 2025-08-13T09:43:59Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+- Define what constitutes a "healthy" or "unhealthy" device. That distinction is
+  made by each DRA driver.


This non-goal collides with what you said on the summary section, line 184

This non-goal is only saying that Kubernetes doesn't care about the underlying characteristics of a device that cause a driver to consider it healthy or not. The summary says cluster administrators are interested in identifying and remediating unhealthy devices. Are those at odds with each other?

IMHO those clearly conflict. To remediate, one needs to know what are the specific issues and their root causes.

This KEP describes where health information can be found and its general structure. DRA drivers populate that health information in the API. Cluster admins use that to help identify and remediate issues.

I don't see where any of that conflicts?

So you're saying that in addition to non-standard information admin requires to actually do something about the health issue, there would be a standard health flag, which admin would monitor to see whether there's a need to look further?

This immediately raises the question that who then decides and configures which conditions trigger such flag.

Because if the flag is raised on things that are irrelevant for the admin, or it's not raised on things that admin cares about, it's not really helping, admin would need to follow the non-standard info anyway.

I agree that this is problematic. Either this KEP limits itself to just defining fields that can be used in a vendor-specific way or it attaches additional semantic to those fields which then must be followed by all drivers. There are pros and cons for both.

Based on these non-goals, the KEP seems to be in the "no semantic" camp. That raises the question whether device attributes would be sufficient, perhaps combined with "admin-controlled device attributes".

@pohly How this would (semantically) interact with: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/4817-resource-claim-device-status/README.md ?

At first glance I don't see the connection. Device health needs to be reported, whether a device is allocated or note. The ResourceClaim status is about providing additional information related to the allocation (for example, the assigned IP in the case of a network interface). I would not use it for the overall device health, although the "raw data" escape hatch in the status would allow that.

I agree that attributes, taints, or metrics seem like a better place to put vendor-specific health information since those are more easily namespaced with qualified domain names than a generic field in the API. Vendors can express more relevant status that's more actionable by admins that way.

I'm still not sure how feasible it is to prescribe any common meaningful semantic to a single overall "healthy" or "unhealthy" signal. Different vendors will probably have mostly different ways of remediating unhealthy devices, so some vendor-specific info will likely be needed anyway to resolve issues.

For now, I'm leaning toward recommending something like the attributes/taints or metrics approaches to encourage vendors and cluster admins to innovate in this area. That obviously shifts the burden onto users in the short-term, but hearing about a few different ways to handle health using those existing options might make common semantics that could be more strictly defined in the API more obvious. Then again, defining some alpha API that gets scrapped or redefined in a few months doesn't seem like a disaster either in case we get it wrong the first time here.

ArangoGutierrez · 2025-08-13T09:44:32Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+}
+
+// DeviceHealthStatus represents the health of a device as observed by the driver.
+type DeviceHealthStatus struct {


ArangoGutierrez · 2025-08-13T09:46:58Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+}
+
+// DeviceHealthStatus represents the health of a device as observed by the driver.
+type DeviceHealthStatus struct {


PR kubernetes/kubernetes#130606 introduced

type DeviceHealthStatus string const ( // DeviceHealthStatusHealthy represents a healthy device. DeviceHealthStatusHealthy DeviceHealthStatus = "Healthy" // DeviceHealthStatusUnhealthy represents an unhealthy device. DeviceHealthStatusUnhealthy DeviceHealthStatus = "Unhealthy" // DeviceHealthStatusUnknown represents a device with unknown health status. DeviceHealthStatusUnknown DeviceHealthStatus = "Unknown" )

Add alternative for vendor-provided metrics

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

nojnhuh · 2025-08-13T22:25:36Z

This is still technically "in-progress" in that it's not ready to merge right now, but I'm ready for early feedback on what's there now to help me fill out the rest of the KEP.

Removing WIP:
/retitle KEP-5283: DRA: ResourceSlice Status for Device Health Tracking

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

eero-t · 2025-08-14T17:03:45Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+- Define what constitutes a "healthy" or "unhealthy" device. That distinction is
+  made by each DRA driver.


IMHO those clearly conflict. To remediate, one needs to know what are the specific issues and their root causes.

eero-t · 2025-08-14T17:26:51Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+#### Enabling Automated Remediation
+
+As a cluster administrator, I want to determine how to remediate unhealthy
+devices when different failure modes require different methods of remediation.


Before one can determine how to remediate an issue, one needs to know what are the issues.

I think both of those are rather device specific, but it may be possible to come up with some broad categories, and how those might be remediated:

Device running too hot / fan not working

Taint device to reduce its load

Notify admin to check cooling

Device memory (ECC) errors

If recurring non-recoverable ones, taint device and notify admin to check/replace memory

Device power delivery issues

Taint device to reduce its load

Notify admin to check PSU

Device power usage throttling

If frequent, taint device to reduce its load and notify admin to check device FW power limits

Overuse of shared device or workload OOMs

Taint device to reduce its load

If recurring frequently, notify admin to check workload resource requests

Device link quality / stability issues

Prefer devices with better link quality => resource request should specify required link BW

If severe enough, ban multi-device workloads and notify admin to investigate

Specific workload hangs / increased user-space driver error counters

Stop scheduling that workload (it may use buggy user-space driver or use that inccorrectly)

Alert admin / dev to investigate that workload

Old / buggy device FW

If there's a set of workloads that work, and do not work correctly with that, use taints

Schedule FW upgrade, and taint device during upgrade

Device hangs / increased kernel driver / FW / HW error counters

Reset specific device part (e.g. compute)

Drain device and reset it

With too many device resets / error increases, taint device and alert admin

Drain all devices on same bus and reset bus

Drain whole node and reset it

Schedule device firmware update

Schedule device replacement

(First ones can be done by (kernel) driver automatically, last ones require admin)

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

eero-t · 2025-08-14T17:51:01Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+The main cost of that flexibility is the lack of standardization, where cluster
+administrators have to track down from each vendor how to determine if a given
+device is in a healthy state as opposed to inspecting a well-defined area of a
+vendor-agnostic API like ResourceSlice. This lack of standardization also makes
+integrations like generic controllers that automatically taint unhealthy devices
+less straightforward to implement.


There is a OpenTelemetry standard for the metrics: https://opentelemetry.io/docs/specs/semconv/

(One of the goals of that standardizations is providing e.g. drill-down support from whole node power usage, to power usage of individual components inside that.)

Admittedly it's still a rather WIP in regards to health related device metrics: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/

See my list above and e.g:

hw.host.power/energy versus hw.power/energy metrics open-telemetry/semantic-conventions#1055

Issues with Hardware Metrics semantic conventions open-telemetry/semantic-conventions#940

Device telemetry stacks provided by the vendors most likely haven't adopted it yet either...

Even with a standard way to determine certain values like fan speed or battery level, vendors need to document what those mean w.r.t. how healthy a device is. I think that's an acceptable way to consider implementing this KEP, but is a step down in some ways to including an overall "healthy"/"unhealthy" signal that could be identical for every kind of device.

Some information / metrics can be rather self-evident (e.g. fan failed). As to rest of metrics, you may have somewhat optimistic view of how much vendor (k8s driver developers) know of their health impact.

How given set of (less obvious) metrics matches to long term health of given device, and at what probability over what time interval, is information that's more likely to be in possession of large cluster operators and their admins.

(HW vendors do not constantly run production workloads in huge clusters, and collect metrics & health statistics of their working, their customer do that, and I suspect they're unlikely to share that info with anybody, even their HW vendor, except to fix specific issues, maybe just for specific team / persons.)

Add user story for purely informational status

Simplify wording in metrics alternative

Add KEP-4680 as related

Add examples to Motivation

Clarify who interprets device metrics

pohly · 2025-08-21T08:21:06Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+some action to restore the device to a healthy state. This KEP defines a
+standard way to determine whether or not a device is considered healthy.


The "summary" section condenses the KEP, so it the proposal is to extend ResourceSlice status then it's important to mention here because it clarifies what readers should expect when diving into the details.

But the "motivation" section shouldn't include that yet because it is an implementation choice. There might be other ways to solve the problems described there.

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

pohly · 2025-08-21T08:28:08Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+- Define what constitutes a "healthy" or "unhealthy" device. That distinction is
+  made by each DRA driver.


I agree that this is problematic. Either this KEP limits itself to just defining fields that can be used in a vendor-specific way or it attaches additional semantic to those fields which then must be followed by all drivers. There are pros and cons for both.

pohly · 2025-08-21T08:31:21Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+- Define what constitutes a "healthy" or "unhealthy" device. That distinction is
+  made by each DRA driver.


Based on these non-goals, the KEP seems to be in the "no semantic" camp. That raises the question whether device attributes would be sufficient, perhaps combined with "admin-controlled device attributes".

pohly · 2025-08-21T08:35:20Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+}
+
+// DeviceHealthStatus represents the health of a device as observed by the driver.
+type DeviceHealthStatus struct {


Making kubelet responsible for updating ResourceSlice status is not going to work for network-attached devices because there is no single kubelet instance which is responsible for those.

pohly · 2025-08-21T08:47:08Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+
+	// Contains the status observed by the driver.
+	// +optional
+	Status ResourceSliceStatus `json:"status,omitempty"`


"By the driver" makes me wonder: is the expectation that this information is always going to come from the driver? It doesn't have to, it could also be a separate component which monitors the health.

This has implications for the API design. If it's always the driver, then including this information in the spec is better. We might even do it via standardized attributes in that case, which would completely remove the need for API changes.

Overall I am a bit weary about putting more information into ResourceSlice if it doesn't absolutely need to be there. It's already challenging to keep the maximum size of it within the required bounds.

Referencing the device is a bit simpler in the ResourceSlice (just needs the name) and it's easier to use in some way (just dump one slice), but it also makes it harder for an admin to actually find unhealthy devices. That would be easier with a dedicated DeviceHealth type where the health information is in the spec, potentially with field filters defined to support server-side filtering in a LIST operation. With one device per DeviceHealth there are no concerns about how big that object then becomes and there are no scale limits imposed by the API.

Another potential alternative: extend DeviceTaint with a "NoEffect" effect and add health information fields there.

"By the driver" makes me wonder: is the expectation that this information is always going to come from the driver? It doesn't have to, it could also be a separate component which monitors the health.

I think I was mostly mirroring the language from ResourceSliceSpec here, but I agree that enforcing where the status comes from doesn't seem necessary. I've updated this to try to reflect that.

Also added a new DeviceHealth resource as an alternative for now, but I think I prefer that option over status for ResourceSlice. Depending on whether we can come up with some vendor-agnostic semantics for health information, I'll update which approach is the proposed one.

Also added a new alternative for "NoEffect" taints.

Describe high-level approach in Summary

Reword goal

Add "New DeviceHealth Resource" alternative

Remove implication that health always comes from a DRA driver

Fix indentation in Standardized Attributes alternative

guptaNswati · 2025-08-26T20:50:21Z

/cc follow

k8s-ci-robot · 2025-08-26T20:50:23Z

@guptaNswati: GitHub didn't allow me to request PR reviews from the following users: guptaNswati.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Add alternative "Standardized Taints with New 'NoEffect' effect"

Update TOC

aojea · 2025-08-29T19:53:42Z

cc @gauravkghildiyal @michaelasp

michaelasp · 2025-08-29T20:51:44Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+`NoSchedule` or `NoExecute` side effect. A standard representation of device
+health in the ResourceSlice API that is strictly informational enables the
+widest variety of potential integrations (e.g. Node Problem Detector, custom
+controllers, dashboards) which may implement custom mitigation strategies that


Would the scheduler do anything with the health information? Or is it solely custom components that create the integration. IMO I think we should add health information, but only if the DRA components internally react on it. If we want to expose health information that only custom components use, then doesn't it make more sense to use the opaque status?

No, this health status is designed to be purely informative. In #5283 (comment) @johnbelamaric described this as "separating the 'facts' from the 'response to the facts'". The idea here is that we define where the health status exists and what it looks like to avoid every vendor or tool from defining that their own way, even if the Kubernetes control plane itself doesn't act on it.

nojnhuh added 2 commits August 6, 2025 15:28

Copy KEP template

920f205

amend! 920f205

44c3a80

KEP-5283: DRA: ResourceSlice Status for Device Health Tracking

k8s-ci-robot requested a review from johnbelamaric August 7, 2025 21:22

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 7, 2025

bg-chun reviewed Aug 9, 2025

View reviewed changes

everpeace reviewed Aug 12, 2025

View reviewed changes

fixup! Copy KEP template

bebbb87

Add TODO to fill out API

Jpsassine reviewed Aug 12, 2025

View reviewed changes

ArangoGutierrez reviewed Aug 13, 2025

View reviewed changes

ArangoGutierrez mentioned this pull request Aug 13, 2025

DRA: ResourceSlice Status for Device Health Tracking #5283

Open

4 tasks

fixup! Copy KEP template

9a961e5

Add alternative for vendor-provided metrics

nojnhuh commented Aug 13, 2025

View reviewed changes

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md Show resolved Hide resolved

k8s-ci-robot changed the title ~~WIP: KEP-5283: DRA: ResourceSlice Status for Device Health Tracking~~ KEP-5283: DRA: ResourceSlice Status for Device Health Tracking Aug 13, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 13, 2025

eero-t reviewed Aug 14, 2025

View reviewed changes

nojnhuh added 5 commits August 14, 2025 16:23

fixup! Copy KEP template

8a45cd2

Add user story for purely informational status

fixup! Copy KEP template

a34a32d

Simplify wording in metrics alternative

fixup! Copy KEP template

0714336

Add KEP-4680 as related

fixup! Copy KEP template

0850cb0

Add examples to Motivation

fixup! Copy KEP template

8c8925e

Clarify who interprets device metrics

pohly reviewed Aug 21, 2025

View reviewed changes

nojnhuh added 2 commits August 22, 2025 16:12

fixup! Copy KEP template

2b4aa66

Describe high-level approach in Summary

fixup! Copy KEP template

6c0501b

Reword goal

guptaNswati mentioned this pull request Aug 26, 2025

GPUs: take device offline when unhealthy (build logic in go-nvlib) NVIDIA/k8s-dra-driver-gpu#360

Open

nojnhuh added 3 commits August 26, 2025 10:59

fixup! Copy KEP template

6b318db

Add "New DeviceHealth Resource" alternative

fixup! Copy KEP template

c8e8cc9

Remove implication that health always comes from a DRA driver

fixup! Copy KEP template

bc8dc72

Fix indentation in Standardized Attributes alternative

nojnhuh added 2 commits August 28, 2025 14:40

fixup! Copy KEP template

965f9e8

Add alternative "Standardized Taints with New 'NoEffect' effect"

fixup! Copy KEP template

df9b97e

Update TOC

michaelasp reviewed Aug 29, 2025

View reviewed changes

pohly mentioned this pull request Sep 4, 2025

KEP 5055: DRA Device Taints: update for 1.35 #5512

Open

		some action to restore the device to a healthy state. This KEP defines a
		standard way to determine whether or not a device is considered healthy.

		- Define what constitutes a "healthy" or "unhealthy" device. That distinction is
		made by each DRA driver.

KEP-5283: DRA: ResourceSlice Status for Device Health Tracking #5469

Are you sure you want to change the base?

KEP-5283: DRA: ResourceSlice Status for Device Health Tracking #5469

Conversation

nojnhuh commented Aug 7, 2025

Uh oh!

k8s-ci-robot commented Aug 7, 2025

Uh oh!

nojnhuh commented Aug 7, 2025

Uh oh!

bg-chun Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eero-t Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eero-t Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nojnhuh commented Aug 13, 2025

bg-chun Aug 9, 2025 •

edited

Loading

eero-t Aug 18, 2025 •

edited

Loading

eero-t Aug 22, 2025 •

edited

Loading

eero-t Aug 18, 2025 •

edited

Loading