Remove HealthStatus defaulting to healthy when there is no value from a health check #6209

adamint · 2024-10-09T17:33:26Z

Description

The proposed changes are

give resources a null health status until health reports start to be received
give resources a healthy status if they do not have health checks and are running
create initial health reports with a null status so that the dashboard immediately knows which health checks will be running and that they have not returned data yet (make HealthReportSnapshot.Status nullable
remove the "Waiting for data..." text and immediately render the grid of health reports, with an indeterminate loading circle while we wait for health check data

@davidfowl @drewnoakes

Fixes #6125

Checklist

Is this feature complete?
- Yes. Ready to ship.
- No. Follow-up changes expected.
Are you including unit tests for the changes and scenario tests if relevant?
- Yes
- No
Did you add public API?
- Yes
  - If yes, did you have an API Review for it?
    - Yes
    - No
  - Did you add <remarks /> and <code /> elements on your triple slash comments?
    - Yes
    - No
- No
Does the change make any security assumptions or guarantees?
- Yes
  - If yes, have you done a threat model and had a security review?
    - Yes
    - No
- No
Does the change require an update in our Aspire docs?
- Yes
  - Link to aspire-docs issue:
- No

Microsoft Reviewers: Open in CodeFlow

… a health check

src/Aspire.Hosting/ApplicationModel/ResourceNotificationService.cs

src/Aspire.Hosting/ApplicationModel/CustomResourceSnapshot.cs

adamint · 2024-10-09T19:22:56Z

/azp run

azure-pipelines · 2024-10-09T19:23:09Z

Azure Pipelines successfully started running 1 pipeline(s).

davidfowl · 2024-10-09T20:35:46Z

I think the tests are failing

JamesNK

Needs test. Could probably add an assert on health status to an existing test that looks at custom resource snapshot properties after publish.

drewnoakes · 2024-10-10T00:05:05Z

I'm not sure the approach here is correct. With the changes here, I expect the UI to show "Running (Unhealthy)" for resources that don't have health checks.

In main, a null value is used so that the dashboard's state column shows a resource as unhealthy (i.e. "Running (Unhealthy)") when we are expecting health reports to arrive. The null applies only when a health check exists, and so carries that information.

This distinction was added as a fix for the following scenario:

A resource that has a health check starts, but we have no health reports.
- It appears in the UI as "Running" and green.
A health report arrives showing that the resource is unhealthy.
- It changes to show as "Running (Unhealthy)" with a white icon.
The resource becomes healthy.
- It changes to "Running" and green again.

The null value allows us to prevent that first state, which incorrectly suggests the resource is healthy.

davidfowl · 2024-10-10T04:51:44Z

I'm not sure the approach here is correct. With the changes here, I expect the UI to show "Running (Unhealthy)" for resources that don't have health checks.

No, it should show Running, not unhealthy. If we don't have health checks then we're in a "we don't know any better and I have to assume the resource is ready" state.

drewnoakes · 2024-10-10T05:55:38Z

No, it should show Running, not unhealthy. If we don't have health checks then we're in a "we don't know any better and I have to assume the resource is ready" state.

Sorry I wasn't clear. I agree with you, and that's what happens today. I should have written:

I'm not sure the approach here is correct. With the changes here, I expect the UI to show "Running (Unhealthy)" for resources that have health checks but don't yet have health reports.

adamint · 2024-10-14T15:43:31Z

I think the tests are failing

It is a helix issue

adamint · 2024-10-14T16:06:56Z

I'm not sure the approach here is correct. With the changes here, I expect the UI to show "Running (Unhealthy)" for resources that don't have health checks.

In main, a null value is used so that the dashboard's state column shows a resource as unhealthy (i.e. "Running (Unhealthy)") when we are expecting health reports to arrive. The null applies only when a health check exists, and so carries that information.

That's a confusing distinction. It carries the meaning of "no current health status available." This meaning also applies to resources that do not contain health checks, because we do not have a health status available for them. So when resources aren't defaulting to healthy, null can be simplified to "no current health status available" for all resources, regardless of health checks being present.

A resource that has a health check starts, but we have no health reports. It appears in the UI as "Running" and green.

With the current text, we don't need to make that distinction between

has health checks but does not yet have a report
does not have health checks

If we do want to indicate that there are health checks present, but that a result has not come back yet, we may want to consider adding a different status, maybe Running (waiting for health check)

adamint · 2024-10-14T16:25:12Z

Needs test. Could probably add an assert on health status to an existing test that looks at custom resource snapshot properties after publish.

Could you clarify? I also added a null assert in ResourceNotificationTests but I'm pretty sure you mean something else.

tests/Aspire.Dashboard.Components.Tests/Controls/StateColumnDisplayTests.cs

JamesNK · 2024-10-14T22:59:30Z

Needs test. Could probably add an assert on health status to an existing test that looks at custom resource snapshot properties after publish.

Could you clarify? I also added a null assert in ResourceNotificationTests but I'm pretty sure you mean something else.

Write a test for each scenario that people are talking about here and assert the health status. That draws a line in the sand for how each scenario would work.

I see:

Resource doesn't have health checks.
Resource has health checks but no result yet.
Resource has health checks and it has a healthy result.

Maybe others? I didn't look at each comment closely.

davidfowl · 2024-10-17T06:39:48Z

src/Aspire.Hosting/ApplicationModel/ResourceNotificationService.cs

+    private static CustomResourceSnapshot UpdateHealthStatus(IResource resource, CustomResourceSnapshot previousState)
+    {
+        // A resource is also healthy if it has no health check annotations and is in the running state.
+        if (!resource.TryGetAnnotationsIncludingAncestorsOfType<HealthCheckAnnotation>(out _) && previousState.State?.Text == KnownResourceStates.Running)


This doesn't need to run every publish, it never changes. Can the caller do this instead?

It becomes much cheaper to call if the first condition is previousState.HealthStatus is not HealthStatus.Healthy - would you have concerns about that?

drewnoakes · 2024-10-17T12:43:39Z

src/Aspire.Hosting/ApplicationModel/ResourceNotificationService.cs

+    private static CustomResourceSnapshot UpdateHealthStatus(IResource resource, CustomResourceSnapshot previousState)
+    {
+        // A resource is also healthy if it has no health check annotations and is in the running state.
+        if (!resource.TryGetAnnotationsIncludingAncestorsOfType<HealthCheckAnnotation>(out _) && previousState.State?.Text == KnownResourceStates.Running)


The new HasAnnotationIncludingAncestorsOfType<T> method in #6357 would make this cheaper to call.

It becomes much cheaper to call if the first condition is previousState.HealthStatus is not HealthStatus.Healthy - changed

JamesNK · 2024-10-17T12:54:57Z

FYI I have some changes to make in the next hour. Nothing too contraversial.

In the interest of time, I'm going to add them directly to this branch.

JamesNK · 2024-10-17T14:04:31Z

Changes:

Moved most of StateColumnDisplay logic into the view model.
Always display state tooltip. If there is no custom tooltip then fallback to state text.
Moved tests from components to dashboard. The actual control isn't being tested with BUnit (just static methods on the control) so it didn't fit in components. Now they're on a view model, a better place for them, they definitely belong in dashboard tests.
Changed state tests to use a shared localizer. It's a bit more robust because you don't have to declare the translation again. Previously if a formatted param was added or removed, the translation in tests could get out of sync.
Treating RuntimeUnhealthy state as stopped produced a sub-optimal tooltip:

Now there is a tooltip that explains the most common problem (container runtime isn't running) and includes a URL for more info.

JamesNK

Thumbs up for dashboard changes

JamesNK · 2024-10-17T14:24:33Z

I tested more and found a situation that doesn't look right to me.

messaging hasn't reported a health status but the state is Running and the icon is green:

I would expect the resource it to have a state like Running (Waiting for health) with the same icon as an unhealthy running resource.

…ready running

drewnoakes · 2024-10-18T02:07:18Z

There's a regression in the status column.

The status for the unhealthy and degraded resources should read Running (unhealthy) and Running (degraded) respectively.

davidfowl · 2024-10-18T02:18:10Z

Maybe we can start writing dashboard view model tests

JamesNK · 2024-10-18T02:41:08Z

I tested and the problem is in ResourceNotificationService (@davidfowl how dare you blame the dashboard 😋 )

aspire/src/Aspire.Hosting/ApplicationModel/ResourceNotificationService.cs

Lines 398 to 402 in f19600f

    
           // A resource is also healthy if it has no health check annotations and is in the running state. 
        
           if (previousState.HealthStatus is not HealthStatus.Healthy && !resource.TryGetAnnotationsIncludingAncestorsOfType<HealthCheckAnnotation>(out _) && previousState.State?.Text == KnownResourceStates.Running) 
        
           { 
        
               return previousState with { HealthStatus = HealthStatus.Healthy }; 
        
           }

There isn't a HealthCheckAnnotation so it is marked as healthy. The health report with a degraded status isn't taken into account.

I'm guessing this problem is unique to the health checks sandbox test app because the health report is manually specified in the snapshot and isn't driven by the annotation like typical usage.

aspire/playground/HealthChecks/HealthChecksSandbox.AppHost/Program.cs

Lines 43 to 49 in f19600f

    
           .WithInitialState(new() 
        
           { 
        
               ResourceType = "Test Resource", 
        
               State = "Starting", 
        
               Properties = [], 
        
               HealthReports = [new HealthReportSnapshot($"{name}_check", status, description, exception)] 
        
           })

It seems like the snapshot health reports is the better place to calculate the health status from. But this is the first time I've looked at health checks so someone with more knowledge here should decide.

drewnoakes · 2024-10-18T02:42:17Z

@adamint will be opening a PR shortly

drewnoakes · 2024-10-18T02:51:39Z

#6367

adamint · 2024-10-18T02:52:42Z

I tested and the problem is in ResourceNotificationService (@davidfowl how dare you blame the dashboard 😋 )

aspire/src/Aspire.Hosting/ApplicationModel/ResourceNotificationService.cs

Lines 398 to 402 in f19600f

// A resource is also healthy if it has no health check annotations and is in the running state.

if (previousState.HealthStatus is not HealthStatus.Healthy && !resource.TryGetAnnotationsIncludingAncestorsOfType<HealthCheckAnnotation>(out _) && previousState.State?.Text == KnownResourceStates.Running)

{

return previousState with { HealthStatus = HealthStatus.Healthy };

}

There isn't a HealthCheckAnnotation so it is marked as healthy. The health report with a degraded status isn't taken into account.

I'm guessing this problem is unique to the health checks sandbox test app because the health report is manually specified in the snapshot and isn't driven by the annotation like typical usage.

aspire/playground/HealthChecks/HealthChecksSandbox.AppHost/Program.cs

Lines 43 to 49 in f19600f

.WithInitialState(new()

{

ResourceType = "Test Resource",

State = "Starting",

Properties = [],

HealthReports = [new HealthReportSnapshot($"{name}_check", status, description, exception)]

})

It seems like the snapshot health reports is the better place to calculate the health status from. But this is the first time I've looked at health checks so someone with more knowledge here should decide.

The bug is a little more generalized than just the health check test app. The root cause is that there should be a check to see if a resource has a ResourceSnapshotAnnotation; if it does, we should not attempt to override the health status. This is correct because sans a health check, there is nothing else that could modify that initial HealthStatus - @JamesNK #6367

JamesNK · 2024-10-18T03:01:29Z

This is correct because sans a health check, there is nothing else that could modify that initial HealthStatus

I don't think that's true. ResourceNotificationService is public. Someone could add a health report after creating the resource.

davidfowl · 2024-10-18T03:25:01Z

maybe we make this property internal. The only thing publishing health state would be via the annotations

Remove HealthStatus defaulting to healthy when there is no value from…

150ec05

… a health check

adamint requested review from davidfowl and drewnoakes October 9, 2024 17:33

adamint commented Oct 9, 2024

View reviewed changes

src/Aspire.Hosting/ApplicationModel/ResourceNotificationService.cs Show resolved Hide resolved

davidfowl reviewed Oct 9, 2024

View reviewed changes

src/Aspire.Hosting/ApplicationModel/CustomResourceSnapshot.cs Show resolved Hide resolved

build-analysis bot mentioned this pull request Oct 9, 2024

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

3 tasks

davidfowl added this to the 9.0 milestone Oct 9, 2024

JamesNK requested changes Oct 9, 2024

View reviewed changes

dotnet-policy-service bot added the needs-author-action An issue or pull request that requires more info or actions from the author. label Oct 9, 2024

dotnet-policy-service bot removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Oct 14, 2024

Show unhealthy only if health status is not null

c26e8dd

Add null assert to status in ResourceNotificationTests

57f11be

adamint requested a review from mitchdenny as a code owner October 14, 2024 16:24

adamint requested a review from JamesNK October 14, 2024 16:25

Add test for StateColumnDisplay static functions

c66b778

adamint commented Oct 14, 2024

View reviewed changes

tests/Aspire.Dashboard.Components.Tests/Controls/StateColumnDisplayTests.cs Outdated Show resolved Hide resolved

Adam Ratzman added 2 commits October 14, 2024 16:04

Merge branch 'main' into dev/adamint/remove-default-healthy-status

7fb3fa0

Update with new expectations

a2df9ed

build-analysis bot mentioned this pull request Oct 14, 2024

The active test run was aborted. Reason: Test host process crashed dotnet/dnceng#451

Open

3 tasks

Create initial health reports if we have not received from server

049dcee

adamint mentioned this pull request Oct 17, 2024

Add an analyzer to prevent indexing a localized value instead of member name #6352

Open

davidfowl reviewed Oct 17, 2024

View reviewed changes

drewnoakes mentioned this pull request Oct 17, 2024

Add ResourceExtensions.HasAnnotation methods #6357

Merged

16 tasks

drewnoakes reviewed Oct 17, 2024

View reviewed changes

JamesNK added 3 commits October 17, 2024 21:33

Refactor state column display and its tests

c37db7d

Runtime unhealthy improvements

749c0b6

Merge

e168dc9

JamesNK requested review from radical and eerhardt as code owners October 17, 2024 13:56

Fix merge

ac14849

Fix state not changing

0297102

JamesNK approved these changes Oct 17, 2024

View reviewed changes

Adam Ratzman added 2 commits October 17, 2024 10:57

check at beginning of update health status if health status is not al…

ead2b13

…ready running

Set state to running (unhealthy) on empty health status

3eacb74

davidfowl approved these changes Oct 17, 2024

View reviewed changes

Update test expectations

3f23964

adamint merged commit ebb06d1 into dotnet:main Oct 17, 2024
9 checks passed

github-actions bot locked and limited conversation to collaborators Nov 17, 2024

github-actions bot added the area-dashboard label Mar 10, 2025

Remove HealthStatus defaulting to healthy when there is no value from a health check #6209

Remove HealthStatus defaulting to healthy when there is no value from a health check #6209

Uh oh!

Conversation

adamint commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Microsoft Reviewers: Open in CodeFlow

Uh oh!

Uh oh!

Uh oh!

adamint commented Oct 9, 2024

Uh oh!

azure-pipelines bot commented Oct 9, 2024

Uh oh!

davidfowl commented Oct 9, 2024

Uh oh!

JamesNK left a comment

Choose a reason for hiding this comment

Uh oh!

drewnoakes commented Oct 10, 2024

Uh oh!

davidfowl commented Oct 10, 2024

Uh oh!

drewnoakes commented Oct 10, 2024

Uh oh!

adamint commented Oct 14, 2024

Uh oh!

adamint commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamint commented Oct 14, 2024

Uh oh!

Uh oh!

JamesNK commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidfowl Oct 17, 2024

Choose a reason for hiding this comment

Uh oh!

adamint Oct 17, 2024

Choose a reason for hiding this comment

Uh oh!

drewnoakes Oct 17, 2024

Choose a reason for hiding this comment

Uh oh!

adamint Oct 17, 2024

Choose a reason for hiding this comment

Uh oh!

JamesNK commented Oct 17, 2024

Uh oh!

JamesNK commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JamesNK left a comment

Choose a reason for hiding this comment

Uh oh!

JamesNK commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

drewnoakes commented Oct 18, 2024

Uh oh!

davidfowl commented Oct 18, 2024

Uh oh!

JamesNK commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drewnoakes commented Oct 18, 2024

Uh oh!

drewnoakes commented Oct 18, 2024

Uh oh!

adamint commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JamesNK commented Oct 18, 2024

Uh oh!

davidfowl commented Oct 18, 2024

Uh oh!

adamint commented Oct 9, 2024 •

edited

Loading

adamint commented Oct 14, 2024 •

edited

Loading

JamesNK commented Oct 14, 2024 •

edited

Loading

JamesNK commented Oct 17, 2024 •

edited

Loading

JamesNK commented Oct 17, 2024 •

edited

Loading

JamesNK commented Oct 18, 2024 •

edited

Loading

adamint commented Oct 18, 2024 •

edited

Loading