-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-5283: DRA: ResourceSlice Status for Device Health Tracking #5469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
KEP-5283: DRA: ResourceSlice Status for Device Health Tracking #5469
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: nojnhuh The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is far from complete, but I'd like some feedback on the Summary and Motivation sections to make sure the problem is scoped appropriately. I'd also like some help figuring out if the high-level ideas in the "Design Details" section are worth pursuing further or if one of the Alternatives seems like a better place to start. |
#### Enabling Automated Remediation | ||
|
||
As a cluster administrator, I want to determine how to remediate unhealthy | ||
devices when different failure modes require different methods of remediation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m wondering how DeviceUnhealthy would be consumed for this user story.
Don’t we need some kind of unhealthy reason or device conditions for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, definitely. I was hoping to get some consensus on which high-level approach to start with before defining the entire API though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before one can determine how to remediate an issue, one needs to know what are the issues.
I think both of those are rather device specific, but it may be possible to come up with some broad categories, and how those might be remediated:
- Device running too hot / fan not working
- Taint device to reduce its load
- Notify admin to check cooling
- Device memory (ECC) errors
- If recurring non-recoverable ones, taint device and notify admin to check/replace memory
- Device power delivery issues
- Taint device to reduce its load
- Notify admin to check PSU
- Device power usage throttling
- If frequent, taint device to reduce its load and notify admin to check device FW power limits
- Overuse of shared device or workload OOMs
- Taint device to reduce its load
- If recurring frequently, notify admin to check workload resource requests
- Device link quality / stability issues
- Prefer devices with better link quality => resource request should specify required link BW
- If severe enough, ban multi-device workloads and notify admin to investigate
- Specific workload hangs / increased user-space driver error counters
- Stop scheduling that workload (it may use buggy user-space driver or use that inccorrectly)
- Alert admin / dev to investigate that workload
- Old / buggy device FW
- If there's a set of workloads that work, and do not work correctly with that, use taints
- Schedule FW upgrade, and taint device during upgrade
- Device hangs / increased kernel driver / FW / HW error counters
- Reset specific device part (e.g. compute)
- Drain device and reset it
- With too many device resets / error increases, taint device and alert admin
- Drain all devices on same bus and reset bus
- Drain whole node and reset it
- Schedule device firmware update
- Schedule device replacement
- (First ones can be done by (kernel) driver automatically, last ones require admin)
// DeviceHealthStatus represents the health of a device as observed by the driver. | ||
type DeviceHealthStatus struct { | ||
// State is the overall health status of a device. | ||
// | ||
// +required | ||
State DeviceHealthState `json:"state"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about introducing a bit more info?? I think we can borrow several fields from PodCondition. For example:
// DeviceHealthStatus represents the health of a device as observed by the driver. | |
type DeviceHealthStatus struct { | |
// State is the overall health status of a device. | |
// | |
// +required | |
State DeviceHealthState `json:"state"` | |
} | |
// DeviceHealthStatus represents the health of a device as observed by the driver. | |
type DeviceHealthStatus struct { | |
// State is the overall health status of a device. | |
// | |
// +required | |
State DeviceHealthState `json:"state"` | |
// Reason is the reason of this device health. It could be helpful especially when the state is "Unhealthy". | |
// +optional | |
Reason string `json:"reason"` | |
// LastTransitionTime is the last time the device health transitioned from one state to another. | |
// +required | |
LastTransitionTime string `json:"lastTransitionTime"` | |
// LastReportedTime is the last reported time for the device health from the driver. | |
// +required | |
LastReportedTime string `json:"lastReportedTime"` | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, more info like this is necessary for this to be useful. I was hoping to get some feedback on if this high-level approach is worth pursuing or if one of the alternatives listed below is a better place to start getting into more of the details of the API.
Add TODO to fill out API
} | ||
|
||
// DeviceHealthStatus represents the health of a device as observed by the driver. | ||
type DeviceHealthStatus struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dra-health/v1alpha1 gRPC service implemented in #130606 already provides a stream of health updates from the DRA plugin to the Kubelet. This same gRPC service could be leveraged as the SoT for populating this new ResourceSlice.status
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR kubernetes/kubernetes#130606 introduced
type DeviceHealthStatus string
const (
// DeviceHealthStatusHealthy represents a healthy device.
DeviceHealthStatusHealthy DeviceHealthStatus = "Healthy"
// DeviceHealthStatusUnhealthy represents an unhealthy device.
DeviceHealthStatusUnhealthy DeviceHealthStatus = "Unhealthy"
// DeviceHealthStatusUnknown represents a device with unknown health status.
DeviceHealthStatusUnknown DeviceHealthStatus = "Unknown"
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jpsassine Does that status only surface for devices that are currently allocated to a Pod? If an unallocated device becomes unhealthy, is that visible anywhere in the Kubernetes API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nojnhuh Yes, so the health status from my PR surfaces health only for devices that are currently allocated to a Pod, which is reported via the new pod.status.containerStatuses.allocatedResourcesStatus
field.
However, it seems this KEP-5283 could exactly address this visibility gap by adding the health status of all the devices to the resource slice.
Regardless of what we surface today, the DRA plugins who implement the new gRPC service DRAResourceHealth
will be streaming all device healths associated with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jpsassine How might we expose the health of devices that are accessible from multiple Nodes, like network attached devices? Does the kubelet on each Node compute the health of the device separately? Is it possible that two Nodes might have differing opinions on the health of the same device? I'm wondering if this KEP would need to define a way to express the health of a device with respect to each Node that could attach it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the DRA driver is the source of truth for device health here, not the Kubelet. In the architecture I implemented for KEP-4680, the kubelet acts as a client that consumes health status streamed from the node-local DRA plugin via the DRAResourceHealth
gRPC service. This design inherently handles the possibility of differing health perspectives between nodes(although, I don't see how there could be a legitimate discrepancy of the same device health between nodes). Since a ResourceSlice
is published by the DRA driver running on a specific node, the health status it contains would naturally reflect the device's condition from that node's perspective.
Example assuming the device healths are used to populate ResourceSlice device statuses:
- If a network attached device is experiencing issues from Node A, the DRA driver on Node A would reoprt it as
Unhealthy
in theResourceSlice
for that node. - Simultaneously, if the same device is accessible from Node B, the driver on Node B would report it as
Healthy
in itsResourceSlice
.
Although, I think this would be odd, it shows that the current model should account for this scenario where one node has the same device as healthy and another as unhealthy.
@SergeyKanzhelev, please correct me if I am wrong, but to the best of my understanding this is how the device health works with DRA now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, to clarify I meant that the kubelet would "compute the health of the device" by invoking the DRA driver. In that way, could where the kubelet currently updates Pod status be extended to also update the ResourceSlice status? Is something like that what you had in mind?
- If a network attached device is experiencing issues from Node A, the DRA driver on Node A would reoprt it as
Unhealthy
in theResourceSlice
for that node.- Simultaneously, if the same device is accessible from Node B, the driver on Node B would report it as
Healthy
in itsResourceSlice
.
I would normally expect that device in this case to be represented in only one ResourceSlice which contains a nodeSelector
matching multiple Nodes or allNodes
instead of a singular nodeName
, so a single "Healthy" signal for a device like that couldn't capture that entire context that the device is currently accessible from one Node but not another.
Storing the status with respect to each Node for each device will get costly though in large clusters where many devices are accessible from many different Nodes. I suppose we could consider that a device identified as "unhealthy" by any DRA driver instance is considered "unhealthy" overall, but then I'm not sure who/what should be responsible for aggregating all those results and determining the final status and updating the ResourceSlice if each kubelet can't do that by itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making kubelet responsible for updating ResourceSlice status is not going to work for network-attached devices because there is no single kubelet instance which is responsible for those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You both are right. Making the kubelet responsible for updating ResourceSlice.status
won't work for network attached devices or multi-node devices. My work in KEP-4680 was intentionally focused on the node level problem of exposing health of a device actively in use by a Pod via the PodStatus.
For the broader goal in KEP-5283, since we want to expose the health of unallocated and allocated devices, the responsibility should be on a cluster level component like the DRA driver's controller.
How I see it:
- Node level dra drivers stream health for all of their devices via the new
DRAResourceHealth
gRPC from KEP-4680. - Kubelet consumes this stream only to update the PodStatus of its local pods
- The DRA driver's central controller should aggregate these streams from all the nodes and would be responsible for writing the health state to the
ResourceSlice.status
Essentially, only leverage the new gRPC stream and health data inflow from DRA drivers, but not on the Kubelet's updating of status.
some action to restore the device to a healthy state. This KEP defines a | ||
standard way to determine whether or not a device is considered healthy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a bigger scope than the KEP title.
some action to restore the device to a healthy state. This KEP defines a | |
standard way to determine whether or not a device is considered healthy. | |
Some action is r to restore the device to a healthy state. This KEP proposes | |
A new entry to the ResourceSlice object to allow DRA drivers to report whether or not a device is considered healthy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would prefer to change the title then if this statement is too far off from it.
@johnbelamaric Would it be appropriate to retitle #5283 something like "DRA: Device Health Status" that doesn't imply any particular solution like adding a new status
field in ResourceSlice but specific enough to differentiate it from #4680?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "summary" section condenses the KEP, so it the proposal is to extend ResourceSlice status then it's important to mention here because it clarifies what readers should expect when diving into the details.
But the "motivation" section shouldn't include that yet because it is an implementation choice. There might be other ways to solve the problems described there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated the summary to mention the high-level approach.
- Define what constitutes a "healthy" or "unhealthy" device. That distinction is | ||
made by each DRA driver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This non-goal collides with what you said on the summary
section, line 184
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This non-goal is only saying that Kubernetes doesn't care about the underlying characteristics of a device that cause a driver to consider it healthy or not. The summary says cluster administrators are interested in identifying and remediating unhealthy devices. Are those at odds with each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO those clearly conflict. To remediate, one needs to know what are the specific issues and their root causes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This KEP describes where health information can be found and its general structure. DRA drivers populate that health information in the API. Cluster admins use that to help identify and remediate issues.
I don't see where any of that conflicts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're saying that in addition to non-standard information admin requires to actually do something about the health issue, there would be a standard health flag, which admin would monitor to see whether there's a need to look further?
This immediately raises the question that who then decides and configures which conditions trigger such flag.
Because if the flag is raised on things that are irrelevant for the admin, or it's not raised on things that admin cares about, it's not really helping, admin would need to follow the non-standard info anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is problematic. Either this KEP limits itself to just defining fields that can be used in a vendor-specific way or it attaches additional semantic to those fields which then must be followed by all drivers. There are pros and cons for both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on these non-goals, the KEP seems to be in the "no semantic" camp. That raises the question whether device attributes would be sufficient, perhaps combined with "admin-controlled device attributes".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pohly How this would (semantically) interact with: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/4817-resource-claim-device-status/README.md ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first glance I don't see the connection. Device health needs to be reported, whether a device is allocated or note. The ResourceClaim status is about providing additional information related to the allocation (for example, the assigned IP in the case of a network interface). I would not use it for the overall device health, although the "raw data" escape hatch in the status would allow that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that attributes, taints, or metrics seem like a better place to put vendor-specific health information since those are more easily namespaced with qualified domain names than a generic field in the API. Vendors can express more relevant status that's more actionable by admins that way.
I'm still not sure how feasible it is to prescribe any common meaningful semantic to a single overall "healthy" or "unhealthy" signal. Different vendors will probably have mostly different ways of remediating unhealthy devices, so some vendor-specific info will likely be needed anyway to resolve issues.
For now, I'm leaning toward recommending something like the attributes/taints or metrics approaches to encourage vendors and cluster admins to innovate in this area. That obviously shifts the burden onto users in the short-term, but hearing about a few different ways to handle health using those existing options might make common semantics that could be more strictly defined in the API more obvious. Then again, defining some alpha API that gets scrapped or redefined in a few months doesn't seem like a disaster either in case we get it wrong the first time here.
} | ||
|
||
// DeviceHealthStatus represents the health of a device as observed by the driver. | ||
type DeviceHealthStatus struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
} | ||
|
||
// DeviceHealthStatus represents the health of a device as observed by the driver. | ||
type DeviceHealthStatus struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR kubernetes/kubernetes#130606 introduced
type DeviceHealthStatus string
const (
// DeviceHealthStatusHealthy represents a healthy device.
DeviceHealthStatusHealthy DeviceHealthStatus = "Healthy"
// DeviceHealthStatusUnhealthy represents an unhealthy device.
DeviceHealthStatusUnhealthy DeviceHealthStatus = "Unhealthy"
// DeviceHealthStatusUnknown represents a device with unknown health status.
DeviceHealthStatusUnknown DeviceHealthStatus = "Unknown"
)
Add alternative for vendor-provided metrics
This is still technically "in-progress" in that it's not ready to merge right now, but I'm ready for early feedback on what's there now to help me fill out the rest of the KEP. Removing WIP: |
keps/sig-node/5283-dra-resourceslice-status-device-health/README.md
Outdated
Show resolved
Hide resolved
- Define what constitutes a "healthy" or "unhealthy" device. That distinction is | ||
made by each DRA driver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO those clearly conflict. To remediate, one needs to know what are the specific issues and their root causes.
#### Enabling Automated Remediation | ||
|
||
As a cluster administrator, I want to determine how to remediate unhealthy | ||
devices when different failure modes require different methods of remediation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before one can determine how to remediate an issue, one needs to know what are the issues.
I think both of those are rather device specific, but it may be possible to come up with some broad categories, and how those might be remediated:
- Device running too hot / fan not working
- Taint device to reduce its load
- Notify admin to check cooling
- Device memory (ECC) errors
- If recurring non-recoverable ones, taint device and notify admin to check/replace memory
- Device power delivery issues
- Taint device to reduce its load
- Notify admin to check PSU
- Device power usage throttling
- If frequent, taint device to reduce its load and notify admin to check device FW power limits
- Overuse of shared device or workload OOMs
- Taint device to reduce its load
- If recurring frequently, notify admin to check workload resource requests
- Device link quality / stability issues
- Prefer devices with better link quality => resource request should specify required link BW
- If severe enough, ban multi-device workloads and notify admin to investigate
- Specific workload hangs / increased user-space driver error counters
- Stop scheduling that workload (it may use buggy user-space driver or use that inccorrectly)
- Alert admin / dev to investigate that workload
- Old / buggy device FW
- If there's a set of workloads that work, and do not work correctly with that, use taints
- Schedule FW upgrade, and taint device during upgrade
- Device hangs / increased kernel driver / FW / HW error counters
- Reset specific device part (e.g. compute)
- Drain device and reset it
- With too many device resets / error increases, taint device and alert admin
- Drain all devices on same bus and reset bus
- Drain whole node and reset it
- Schedule device firmware update
- Schedule device replacement
- (First ones can be done by (kernel) driver automatically, last ones require admin)
keps/sig-node/5283-dra-resourceslice-status-device-health/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/5283-dra-resourceslice-status-device-health/README.md
Outdated
Show resolved
Hide resolved
The main cost of that flexibility is the lack of standardization, where cluster | ||
administrators have to track down from each vendor how to determine if a given | ||
device is in a healthy state as opposed to inspecting a well-defined area of a | ||
vendor-agnostic API like ResourceSlice. This lack of standardization also makes | ||
integrations like generic controllers that automatically taint unhealthy devices | ||
less straightforward to implement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a OpenTelemetry standard for the metrics: https://opentelemetry.io/docs/specs/semconv/
(One of the goals of that standardizations is providing e.g. drill-down support from whole node power usage, to power usage of individual components inside that.)
Admittedly it's still a rather WIP in regards to health related device metrics: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/
See my list above and e.g:
- hw.host.power/energy versus hw.power/energy metrics open-telemetry/semantic-conventions#1055
- Issues with Hardware Metrics semantic conventions open-telemetry/semantic-conventions#940
Device telemetry stacks provided by the vendors most likely haven't adopted it yet either...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with a standard way to determine certain values like fan speed or battery level, vendors need to document what those mean w.r.t. how healthy a device is. I think that's an acceptable way to consider implementing this KEP, but is a step down in some ways to including an overall "healthy"/"unhealthy" signal that could be identical for every kind of device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some information / metrics can be rather self-evident (e.g. fan failed). As to rest of metrics, you may have somewhat optimistic view of how much vendor (k8s driver developers) know of their health impact.
How given set of (less obvious) metrics matches to long term health of given device, and at what probability over what time interval, is information that's more likely to be in possession of large cluster operators and their admins.
(HW vendors do not constantly run production workloads in huge clusters, and collect metrics & health statistics of their working, their customer do that, and I suspect they're unlikely to share that info with anybody, even their HW vendor, except to fix specific issues, maybe just for specific team / persons.)
Add user story for purely informational status
Simplify wording in metrics alternative
Add KEP-4680 as related
Add examples to Motivation
Clarify who interprets device metrics
some action to restore the device to a healthy state. This KEP defines a | ||
standard way to determine whether or not a device is considered healthy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "summary" section condenses the KEP, so it the proposal is to extend ResourceSlice status then it's important to mention here because it clarifies what readers should expect when diving into the details.
But the "motivation" section shouldn't include that yet because it is an implementation choice. There might be other ways to solve the problems described there.
keps/sig-node/5283-dra-resourceslice-status-device-health/README.md
Outdated
Show resolved
Hide resolved
- Define what constitutes a "healthy" or "unhealthy" device. That distinction is | ||
made by each DRA driver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is problematic. Either this KEP limits itself to just defining fields that can be used in a vendor-specific way or it attaches additional semantic to those fields which then must be followed by all drivers. There are pros and cons for both.
- Define what constitutes a "healthy" or "unhealthy" device. That distinction is | ||
made by each DRA driver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on these non-goals, the KEP seems to be in the "no semantic" camp. That raises the question whether device attributes would be sufficient, perhaps combined with "admin-controlled device attributes".
} | ||
|
||
// DeviceHealthStatus represents the health of a device as observed by the driver. | ||
type DeviceHealthStatus struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making kubelet responsible for updating ResourceSlice status is not going to work for network-attached devices because there is no single kubelet instance which is responsible for those.
|
||
// Contains the status observed by the driver. | ||
// +optional | ||
Status ResourceSliceStatus `json:"status,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"By the driver" makes me wonder: is the expectation that this information is always going to come from the driver? It doesn't have to, it could also be a separate component which monitors the health.
This has implications for the API design. If it's always the driver, then including this information in the spec is better. We might even do it via standardized attributes in that case, which would completely remove the need for API changes.
Overall I am a bit weary about putting more information into ResourceSlice if it doesn't absolutely need to be there. It's already challenging to keep the maximum size of it within the required bounds.
Referencing the device is a bit simpler in the ResourceSlice (just needs the name) and it's easier to use in some way (just dump one slice), but it also makes it harder for an admin to actually find unhealthy devices. That would be easier with a dedicated DeviceHealth
type where the health information is in the spec, potentially with field filters defined to support server-side filtering in a LIST operation. With one device per DeviceHealth
there are no concerns about how big that object then becomes and there are no scale limits imposed by the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another potential alternative: extend DeviceTaint with a "NoEffect" effect and add health information fields there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"By the driver" makes me wonder: is the expectation that this information is always going to come from the driver? It doesn't have to, it could also be a separate component which monitors the health.
I think I was mostly mirroring the language from ResourceSliceSpec
here, but I agree that enforcing where the status comes from doesn't seem necessary. I've updated this to try to reflect that.
Also added a new DeviceHealth resource as an alternative for now, but I think I prefer that option over status
for ResourceSlice. Depending on whether we can come up with some vendor-agnostic semantics for health information, I'll update which approach is the proposed one.
Also added a new alternative for "NoEffect" taints.
Describe high-level approach in Summary
Reword goal
Add "New DeviceHealth Resource" alternative
Remove implication that health always comes from a DRA driver
Fix indentation in Standardized Attributes alternative
/cc follow |
@guptaNswati: GitHub didn't allow me to request PR reviews from the following users: guptaNswati. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Add alternative "Standardized Taints with New 'NoEffect' effect"
Update TOC
`NoSchedule` or `NoExecute` side effect. A standard representation of device | ||
health in the ResourceSlice API that is strictly informational enables the | ||
widest variety of potential integrations (e.g. Node Problem Detector, custom | ||
controllers, dashboards) which may implement custom mitigation strategies that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the scheduler do anything with the health information? Or is it solely custom components that create the integration. IMO I think we should add health information, but only if the DRA components internally react on it. If we want to expose health information that only custom components use, then doesn't it make more sense to use the opaque status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this health status is designed to be purely informative. In #5283 (comment) @johnbelamaric described this as "separating the 'facts' from the 'response to the facts'". The idea here is that we define where the health status exists and what it looks like to avoid every vendor or tool from defining that their own way, even if the Kubernetes control plane itself doesn't act on it.
/cc @johnbelamaric