Skip to content

Conversation

adamint
Copy link
Member

@adamint adamint commented Oct 9, 2024

Description

The proposed changes are

  • give resources a null health status until health reports start to be received
  • give resources a healthy status if they do not have health checks and are running
  • create initial health reports with a null status so that the dashboard immediately knows which health checks will be running and that they have not returned data yet (make HealthReportSnapshot.Status nullable
  • remove the "Waiting for data..." text and immediately render the grid of health reports, with an indeterminate loading circle while we wait for health check data

Animation

@davidfowl @drewnoakes

Fixes #6125

Checklist

  • Is this feature complete?
    • Yes. Ready to ship.
    • No. Follow-up changes expected.
  • Are you including unit tests for the changes and scenario tests if relevant?
    • Yes
    • No
  • Did you add public API?
    • Yes
      • If yes, did you have an API Review for it?
        • Yes
        • No
      • Did you add <remarks /> and <code /> elements on your triple slash comments?
        • Yes
        • No
    • No
  • Does the change make any security assumptions or guarantees?
    • Yes
      • If yes, have you done a threat model and had a security review?
        • Yes
        • No
    • No
  • Does the change require an update in our Aspire docs?
    • Yes
      • Link to aspire-docs issue:
    • No
Microsoft Reviewers: Open in CodeFlow

@adamint
Copy link
Member Author

adamint commented Oct 9, 2024

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@davidfowl
Copy link
Member

I think the tests are failing

Copy link
Member

@JamesNK JamesNK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs test. Could probably add an assert on health status to an existing test that looks at custom resource snapshot properties after publish.

@dotnet-policy-service dotnet-policy-service bot added the needs-author-action An issue or pull request that requires more info or actions from the author. label Oct 9, 2024
@drewnoakes
Copy link
Member

I'm not sure the approach here is correct. With the changes here, I expect the UI to show "Running (Unhealthy)" for resources that don't have health checks.

In main, a null value is used so that the dashboard's state column shows a resource as unhealthy (i.e. "Running (Unhealthy)") when we are expecting health reports to arrive. The null applies only when a health check exists, and so carries that information.

This distinction was added as a fix for the following scenario:

  • A resource that has a health check starts, but we have no health reports.
    • It appears in the UI as "Running" and green.
  • A health report arrives showing that the resource is unhealthy.
    • It changes to show as "Running (Unhealthy)" with a white icon.
  • The resource becomes healthy.
    • It changes to "Running" and green again.

The null value allows us to prevent that first state, which incorrectly suggests the resource is healthy.

@davidfowl
Copy link
Member

I'm not sure the approach here is correct. With the changes here, I expect the UI to show "Running (Unhealthy)" for resources that don't have health checks.

No, it should show Running, not unhealthy. If we don't have health checks then we're in a "we don't know any better and I have to assume the resource is ready" state.

@drewnoakes
Copy link
Member

No, it should show Running, not unhealthy. If we don't have health checks then we're in a "we don't know any better and I have to assume the resource is ready" state.

Sorry I wasn't clear. I agree with you, and that's what happens today. I should have written:

I'm not sure the approach here is correct. With the changes here, I expect the UI to show "Running (Unhealthy)" for resources that have health checks but don't yet have health reports.

@adamint
Copy link
Member Author

adamint commented Oct 14, 2024

I think the tests are failing

It is a helix issue

@dotnet-policy-service dotnet-policy-service bot removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Oct 14, 2024
@adamint
Copy link
Member Author

adamint commented Oct 14, 2024

I'm not sure the approach here is correct. With the changes here, I expect the UI to show "Running (Unhealthy)" for resources that don't have health checks.

In main, a null value is used so that the dashboard's state column shows a resource as unhealthy (i.e. "Running (Unhealthy)") when we are expecting health reports to arrive. The null applies only when a health check exists, and so carries that information.

That's a confusing distinction. It carries the meaning of "no current health status available." This meaning also applies to resources that do not contain health checks, because we do not have a health status available for them. So when resources aren't defaulting to healthy, null can be simplified to "no current health status available" for all resources, regardless of health checks being present.

A resource that has a health check starts, but we have no health reports. It appears in the UI as "Running" and green.

With the current text, we don't need to make that distinction between

  • has health checks but does not yet have a report
  • does not have health checks

If we do want to indicate that there are health checks present, but that a result has not come back yet, we may want to consider adding a different status, maybe Running (waiting for health check)

@adamint adamint requested a review from mitchdenny as a code owner October 14, 2024 16:24
@adamint
Copy link
Member Author

adamint commented Oct 14, 2024

Needs test. Could probably add an assert on health status to an existing test that looks at custom resource snapshot properties after publish.

Could you clarify? I also added a null assert in ResourceNotificationTests but I'm pretty sure you mean something else.

@adamint adamint requested a review from JamesNK October 14, 2024 16:25
@JamesNK
Copy link
Member

JamesNK commented Oct 14, 2024

Needs test. Could probably add an assert on health status to an existing test that looks at custom resource snapshot properties after publish.

Could you clarify? I also added a null assert in ResourceNotificationTests but I'm pretty sure you mean something else.

Write a test for each scenario that people are talking about here and assert the health status. That draws a line in the sand for how each scenario would work.

I see:

  • Resource doesn't have health checks.
  • Resource has health checks but no result yet.
  • Resource has health checks and it has a healthy result.

Maybe others? I didn't look at each comment closely.

private static CustomResourceSnapshot UpdateHealthStatus(IResource resource, CustomResourceSnapshot previousState)
{
// A resource is also healthy if it has no health check annotations and is in the running state.
if (!resource.TryGetAnnotationsIncludingAncestorsOfType<HealthCheckAnnotation>(out _) && previousState.State?.Text == KnownResourceStates.Running)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to run every publish, it never changes. Can the caller do this instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It becomes much cheaper to call if the first condition is previousState.HealthStatus is not HealthStatus.Healthy - would you have concerns about that?

private static CustomResourceSnapshot UpdateHealthStatus(IResource resource, CustomResourceSnapshot previousState)
{
// A resource is also healthy if it has no health check annotations and is in the running state.
if (!resource.TryGetAnnotationsIncludingAncestorsOfType<HealthCheckAnnotation>(out _) && previousState.State?.Text == KnownResourceStates.Running)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new HasAnnotationIncludingAncestorsOfType<T> method in #6357 would make this cheaper to call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It becomes much cheaper to call if the first condition is previousState.HealthStatus is not HealthStatus.Healthy - changed

@JamesNK
Copy link
Member

JamesNK commented Oct 17, 2024

FYI I have some changes to make in the next hour. Nothing too contraversial.

In the interest of time, I'm going to add them directly to this branch.

@JamesNK
Copy link
Member

JamesNK commented Oct 17, 2024

Changes:

  • Moved most of StateColumnDisplay logic into the view model.
  • Always display state tooltip. If there is no custom tooltip then fallback to state text.
  • Moved tests from components to dashboard. The actual control isn't being tested with BUnit (just static methods on the control) so it didn't fit in components. Now they're on a view model, a better place for them, they definitely belong in dashboard tests.
  • Changed state tests to use a shared localizer. It's a bit more robust because you don't have to declare the translation again. Previously if a formatted param was added or removed, the translation in tests could get out of sync.
  • Treating RuntimeUnhealthy state as stopped produced a sub-optimal tooltip:
    image
    Now there is a tooltip that explains the most common problem (container runtime isn't running) and includes a URL for more info.

Copy link
Member

@JamesNK JamesNK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thumbs up for dashboard changes

@JamesNK
Copy link
Member

JamesNK commented Oct 17, 2024

I tested more and found a situation that doesn't look right to me.

messaging hasn't reported a health status but the state is Running and the icon is green:

image

I would expect the resource it to have a state like Running (Waiting for health) with the same icon as an unhealthy running resource.

@adamint adamint merged commit ebb06d1 into dotnet:main Oct 17, 2024
9 checks passed
@drewnoakes
Copy link
Member

There's a regression in the status column.

image

The status for the unhealthy and degraded resources should read Running (unhealthy) and Running (degraded) respectively.

@davidfowl
Copy link
Member

Maybe we can start writing dashboard view model tests

@JamesNK
Copy link
Member

JamesNK commented Oct 18, 2024

I tested and the problem is in ResourceNotificationService (@davidfowl how dare you blame the dashboard 😋 )

// A resource is also healthy if it has no health check annotations and is in the running state.
if (previousState.HealthStatus is not HealthStatus.Healthy && !resource.TryGetAnnotationsIncludingAncestorsOfType<HealthCheckAnnotation>(out _) && previousState.State?.Text == KnownResourceStates.Running)
{
return previousState with { HealthStatus = HealthStatus.Healthy };
}

There isn't a HealthCheckAnnotation so it is marked as healthy. The health report with a degraded status isn't taken into account.

I'm guessing this problem is unique to the health checks sandbox test app because the health report is manually specified in the snapshot and isn't driven by the annotation like typical usage.

.WithInitialState(new()
{
ResourceType = "Test Resource",
State = "Starting",
Properties = [],
HealthReports = [new HealthReportSnapshot($"{name}_check", status, description, exception)]
})

It seems like the snapshot health reports is the better place to calculate the health status from. But this is the first time I've looked at health checks so someone with more knowledge here should decide.

@drewnoakes
Copy link
Member

@adamint will be opening a PR shortly

@drewnoakes
Copy link
Member

#6367

@adamint
Copy link
Member Author

adamint commented Oct 18, 2024

I tested and the problem is in ResourceNotificationService (@davidfowl how dare you blame the dashboard 😋 )

// A resource is also healthy if it has no health check annotations and is in the running state.
if (previousState.HealthStatus is not HealthStatus.Healthy && !resource.TryGetAnnotationsIncludingAncestorsOfType<HealthCheckAnnotation>(out _) && previousState.State?.Text == KnownResourceStates.Running)
{
return previousState with { HealthStatus = HealthStatus.Healthy };
}

There isn't a HealthCheckAnnotation so it is marked as healthy. The health report with a degraded status isn't taken into account.

I'm guessing this problem is unique to the health checks sandbox test app because the health report is manually specified in the snapshot and isn't driven by the annotation like typical usage.

.WithInitialState(new()
{
ResourceType = "Test Resource",
State = "Starting",
Properties = [],
HealthReports = [new HealthReportSnapshot($"{name}_check", status, description, exception)]
})

It seems like the snapshot health reports is the better place to calculate the health status from. But this is the first time I've looked at health checks so someone with more knowledge here should decide.

The bug is a little more generalized than just the health check test app. The root cause is that there should be a check to see if a resource has a ResourceSnapshotAnnotation; if it does, we should not attempt to override the health status. This is correct because sans a health check, there is nothing else that could modify that initial HealthStatus - @JamesNK #6367

@JamesNK
Copy link
Member

JamesNK commented Oct 18, 2024

This is correct because sans a health check, there is nothing else that could modify that initial HealthStatus

I don't think that's true. ResourceNotificationService is public. Someone could add a health report after creating the resource.

@davidfowl
Copy link
Member

maybe we make this property internal. The only thing publishing health state would be via the annotations

@github-actions github-actions bot locked and limited conversation to collaborators Nov 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resources in a waiting state should not show up as healthy
5 participants