KEP-5468: Invariant Testing #5514

BenTheElder · 2025-09-05T00:30:22Z

One-line PR description: Introduce invariant testing KEP.

Issue link: Invariant Testing #5468

Other comments: This process has a truly annoying amount of redundancy and boilerplate. We really need to think about how we can streamline things a bit. I spent far more time on this KEP than finding an agreeable solution to the problem and implementing it.

keps/sig-testing/5468-invariant-testing/README.md

BenTheElder · 2025-09-05T01:17:16Z

keps/sig-testing/5468-invariant-testing/kep.yaml

+# worked on.
+latest-milestone: "v1.35"
+
+# The milestone at which this feature was, or is targeted to be, at each stage.


will update this pending above discussion about the alpha/beta/stable lifecycle not really making sense for this KEP https://github.com/kubernetes/enhancements/pull/5514/files#r2323831978

BenTheElder · 2025-09-05T01:19:27Z

keps/sig-testing/5468-invariant-testing/kep.yaml

+  - sig-api-machinery
+status: provisional
+creation-date: 2025-09-04
+reviewers:


maybe @jpbetz ?

richa per #5468 (comment)

patrick and antonio as fellow SIG Testing TLs

BenTheElder · 2025-09-05T01:23:23Z

There is a draft of the implementation in kubernetes/kubernetes#133394, though I intend to capture all necessary details here.

Mostly here we want to formalize what sort of "run after all other tests" is permitted and how we're approaching that.

pohly · 2025-09-05T05:55:15Z

/assign

keps/sig-testing/5468-invariant-testing/README.md

pohly · 2025-09-09T05:46:27Z

keps/sig-testing/5468-invariant-testing/README.md

+mechanisms to select / skip them.
+
+We can then use `ginkgo.ReportAfterSuite` to inspect which, if any, of these 
+sentinel invariant tests ran, and run the *actual* test logic, reporting pass/fail.


Do we want to run the invariant checking after an incomplete E2E test run? If the E2E suite timed out in the middle or got interrupted during a manual run, the sentinel test may not have run yet, thus (with the current approach) skipping the invariant checking. Given the random ordering of tests, it's non-deterministic.

I think this can be argued either way, but it should be documented. I'm leaning towards always running.

If we want to run it also in case of a partial E2E run, then I think we can: instead of checking after the run whether the sentinel did run, we check before the run whether the sentinel is meant to run. This works because ReportAfterSuite gets called before the actual run with the result of a dry-run.

Do we want to run the invariant checking after an incomplete E2E test run? If the E2E suite timed out in the middle or got interrupted during a manual run, the sentinel test may not have run yet, thus (with the current approach) skipping the invariant checking. Given the random ordering of tests, it's non-deterministic.

I don't think it matters, a partial run is best-effort regardless. This is into YAGNI territory.

There's no reason to optimize for failing to run the complete suite.

It's not really different than another test, if we timeout or get interrupted then it is non-deterministic which tests actually ran before then.

Even if a run was interrupted, checking the invariants might still provide some relevant information. But I don't care strongly about this. Let's just document the behavior for the sake of clarity.

I fail to see a system were an incomplete job can be a good source of signal for anything

Let's just document the behavior for the sake of clarity.

The behavior is the same as any other test (best effort, random scheduling, may fail if the run is terminated early or times out), I'm not sure where we should document this. In one of the KEP sections?

Note that in CI no matter how you implement this we always have the case that tests may not run ... we do not guarantee that the entire CI pod will not be terminated. The scheduling aspect is a relatively unimportant detail of "timeouts / early interrupts will lead to incomplete results" which is true even for non-e2e CI.

pohly · 2025-09-09T05:52:03Z

keps/sig-testing/5468-invariant-testing/README.md

+
+Something like this, for the failure message:
+> Invariant failed for missing metric: <metric-name>
+> If this failed on a PR, check your PR for bugs, if this failed on a periodic test please file a bug and assign the owners.


Isn't "check your PR for bugs" the wrong message and in conflict with "tragedy of the commons"? It implies that it is the responsibility of the PR owner instead of the invariant owner to figure out why the invariant was violated. If we go down this route, it becomes the responsibility of everyone to debug the feature tested by the invariants.

Invariants are most likely to break because of some change, i.e. in a PR, unless we merge some changes where tests only run in CI or in additional environments.

OTOH, the invariant might really have been broken by the PR.

How about this?

If this failed, search for an existing GitHub issue. It might be a known problem that you can ignore. Otherwise, if this failed on a periodic job, please file a bug and assign the owners. If this failed on a PR, try to determine if it is something that could have been caused by your changes. If not or unsure, file an issue.

Isn't "check your PR for bugs" the wrong message and in conflict with "tragedy of the commons"? It implies that it is the responsibility of the PR owner instead of the invariant owner to figure out why the invariant was violated. If we go down this route, it becomes the responsibility of everyone to debug the feature tested by the invariants.

We can rephrase it, but the first reaction to a failure on the PR should be to check if the PR caused the problem, NOT to go file a bug for the documented owners, that's not appropriate and not scalable.

Invariants are most likely to break because of some change, i.e. in a PR, unless we merge some changes where tests only run in CI or in additional environments.

They can start to flake in periodic testing due to some change that reduced reliability, if it fails in a periodic then we clearly should have the owners of the check follow up. That is unambiguous, versus failures on a pull request.

Again, the "check your own PR for bugs" is the case for all tests we have currently.

But there will be a documented owner to escalate to. We should elaborate explicitly that after checking into the PR yourself if you can't see how it would be related then you can escalate to the owners.

But it won't be useful to have documented owners if they are pinged for every failure, it will become noise no better than tragedy of the commons.

If you touch something in a PR and the invariant check goes red, it's not that likely that you can't relate it all. The most likely cases are:

it's clearly related to the PR (e.g. you changed validation and broke the validation check)

it's flaking, which we'll find via periodics flaking.

For PRs, it is better in both cases that the PR author / reviewers / approvers look into the PR breaking tests first (even if it turns out to be a flake, there's not a good way to be sure, if it is flaking in periodics we hopefully will already address it by fixing or disabling the test relatively quickly).
We can add some language around getting help after attempting to identify the issue with your PR, will need to think a bit about how to phrase that.

I agree that some due diligence by the PR author is needed.

I'm not sure whether you saw my comment with the "OTOH" (#5514 (comment)). I was suggesting an update of the wording there.

I'm going to tweak this a bit, the proposed update still sounds like we expect broken checks as a regular course of business, I really don't expect that and I don't want to send that message.

Because it appears in the KEP? The only time when users are going to see it is when it really has gone wrong, so I'm not worried about this.

We can also finalize the actual log message in the implementation PR(s) ...

True, but then we need to go back to the KEP and update that, or merge the KEP with an explicit comment that the proposal is just an example and that new usages should follow the implementation. But that would diminish the value of the KEP. 🤷

Because it appears in the KEP? The only time when users are going to see it is when it really has gone wrong, so I'm not worried about this.

No, I mean we should not ever send the message to assume that tests failing is not related to their PR by default.

True, but then we need to go back to the KEP and update that, or merge the KEP with an explicit comment that the proposal is just an example and that new usages should follow the implementation. But that would diminish the value of the KEP. 🤷

KEPs usually aren't used to finalize implementation strings, I could delete the example here /shrug, the point of the KEP is to agree in concept to what we're permitting and how we're going about it, not to exactly what strings will be in the code ...

[I did update the wording here ...]

My intention was to capture that we would require a failure message instructing how to handle the test, perhaps it would be better to spell out this requirements than include a possible templated message. The actual exact template inputs etc will likely best be covered in implementation review.

I think I have already covered the requirements for that without the sample message.

aojea · 2025-09-10T11:24:43Z

LGTM

there are some open comments, but we have been discussing this for a long time and I think describes pretty well the overall idea, of course when moving to execution we may find some unexpected issues that we need to reconsider, but I think is important to get this early in the cycle to get feedback before the large features start to get merged and avoid to get perfection in the KEP

richabanker · 2025-09-10T19:07:56Z

/lgtm
from instrumentation side

richabanker · 2025-09-10T19:08:36Z

/hold

for other comments to be resolved

BenTheElder · 2025-09-10T21:42:23Z

This is failing because we don't have the PRR approver metadata yet.

Will reach out when we sort api-machinery, I posted in their slack last week, reached out to a lead today.

pohly

Looks good overall, some nits and suggestions.

pohly · 2025-09-11T06:08:14Z

keps/sig-testing/5468-invariant-testing/README.md

@@ -0,0 +1,970 @@
+<!--
+**Note:** When your KEP is complete, all of these comment blocks should be removed.


Nit: you could make the README.md easier to review if you dropped comments which are no longer needed.

/shrug, this is an artifact of the KEP process, this KEP is not complete.
that these instructions are super verbose .... is another matter

You can keep those which are still relevant. But below you have filled out entire sections with "not applicable" while keeping the comments. Those comments could be removed because the section is complete.

As it stands now, the README.md is mostly obsolete comments with very little actual content.

keps/sig-testing/5468-invariant-testing/README.md

pohly · 2025-09-11T06:12:51Z

keps/sig-testing/5468-invariant-testing/README.md

+>
+> If this failed in a periodic CI, please file a bug and assign the owners.
+>
+> Owners for this metric: <github-handles>


Suggested change

> Owners for this metric: <github-handles>

> Owners for this metric: <github-handles and/or SIG>

I think we really need a list of specific owners of the feature, assigning SIGs to resolve bugs does not work for most SIGs. Maybe a few SIGs are extremely well staffed, but in general we need to require that a list of multiple owners is responsible for a feature.

I'm worried that we'll end up with situations where individual owners have moved on. Whom should a problem be escalated to when that happens?

Listing the SIG would also be useful for /sig ... in the issue that we ask contributors to file.

Let's make that <github-handles and SIG(s)>?

That makes sense, I just think they need to not be interchangeable and that a list of owners should be required. But also documenting the SIG makes sense.

Especially when this is first introduced for a feature there should be some specific owners with context that can help ensure there are not issues. Similar to the underlying metrics (which currently list a single owner and SIG).

I also think we should consider as we approve these requiring some minimum number of owners, but I don't know what to set it to and I do want to make sure we start testing out the initial invariant early in the cycle, 1.35 will be a short cycle and we also may risk holding up declarative validation if we wind up having to revert this test.

I have pretty high confidence in the current implementation but ... unknown unknowns until we actually start soaking it.

So do you want to add the "and SIG" or "and SIG(s)". If not, then I can LGTM.

Never mind, it's not ready for other reasons (PRR), as you noted in #5514 (comment).

I added the SIG (though we may still revisit the exact formatting/language), and I took a stab at PRR by way of just marking it direct to stable ...

Last time I did one of these "non-production" KEPs the process was a bit different ...

… production changes and has no graduation criteria

BenTheElder · 2025-09-16T22:49:53Z

/cc @jpbetz

vrutkovs · 2025-09-17T11:17:25Z

keps/sig-testing/5468-invariant-testing/README.md

+These tests can be labeled appropriately, allowing the existing test selection
+mechanisms to select / skip them.
+
+We can then use `ginkgo.ReportAfterSuite` to inspect which, if any, of these 


ReportAfterSuite seems to be used mostly for custom reports, so it's not designed to run more tests.

In OpenShift we use two test annotations: [Early] and [Late] and split tests in the suite, running them in necessary order. Have this alternative been considered?

ReportAfterSuite seems to be used mostly for custom reports, so it's not designed to run more tests.

Yeah, we're not running actual tests here though, we're only fetching metrics (which we already have an option to do after the test suite) and validating some of them, if not for the metric fetch part, it fits as a custom report. The metrics fetch part is cheap and quick anyhow so 🤷‍♂️

We explicitly do not wish to permit anything more expensive than this, or any mutable actions, or any intra-test ordering, only checking cluster status after-suite and reporting if it's invalid.

IMHO as a general rule, tests that need to be ordered are a huge anti-pattern especially in shared cluster e2e environment, and between the SIG Testing leads we've agreed to not permit this.

We actually considered doing this entirely outside of ginkgo just as part of TestMain() and having an env opt in/out, but leveraging the test selection logic to decide if it should run simplifies getting it enabled in the right jobs.

In OpenShift we use two test annotations: [Early] and [Late] and split tests in the suite, running them in necessary order. Have this alternative been considered?

This is an interesting option, but it would be more involved for our case, since we are using ginkgo's stock runner / scheduling and otherwise have no reason to rewrite this, we plan to keep randomized order for all tests other than this special invariant check.

I think it would make sense if we planned to have more tests with special ordering, but we don't.

I really prefer to avoid to customize ginkgo, since that will add more surface to cover and to maintain, and we got a lot of troubles because of that when we had to migrate from v1 to v2 ... this is pretty simple and decoupled

The metrics fetch part is cheap and quick anyhow so 🤷‍♂️

I wonder if there is a way to limit the code to stay cheap and quick?

we plan to keep randomized order for all tests other than this special invariant check

Sure, the test list can be additionally shuffled to ensure tests don't depend on special order. I acknowledge it may be using advancement ginkgo features, but it doesn't seem like a significant ginkgo modification to me.

In any case, this idea is worth adding to the alternative list imo

aojea · 2025-09-18T19:24:15Z

/lgtm
/approve

I think we have an overall agreement on how to proceed, and the proposed implementation is encapsulated enough that we can always revisit without causing much problems, but at one point we need to execute and start to find the unknown unknowns ... 🤞

k8s-ci-robot · 2025-09-18T19:24:24Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aojea, BenTheElder
Once this PR has been reviewed and has the lgtm label, please assign wojtek-t for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
~~keps/sig-testing/OWNERS~~ [BenTheElder,aojea]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from aojea, pohly and richabanker September 5, 2025 00:30

BenTheElder mentioned this pull request Sep 5, 2025

Invariant Testing #5468

Open

4 tasks

BenTheElder force-pushed the invariant branch 2 times, most recently from 5d434e5 to 7d568c8 Compare September 5, 2025 00:47

BenTheElder mentioned this pull request Sep 5, 2025

WIP: check invariant metrics after e2e tests kubernetes/kubernetes#133394

Open

BenTheElder force-pushed the invariant branch from 7d568c8 to f44f65a Compare September 5, 2025 00:54

BenTheElder commented Sep 5, 2025

View reviewed changes

keps/sig-testing/5468-invariant-testing/README.md Show resolved Hide resolved

introduce invariant testing KEP

0c1d1f2

BenTheElder force-pushed the invariant branch from f44f65a to 0c1d1f2 Compare September 5, 2025 01:09

BenTheElder commented Sep 5, 2025

View reviewed changes

k8s-ci-robot assigned pohly Sep 5, 2025

richabanker reviewed Sep 7, 2025

View reviewed changes

keps/sig-testing/5468-invariant-testing/README.md Show resolved Hide resolved

pohly reviewed Sep 9, 2025

View reviewed changes

k8s-ci-robot assigned richabanker Sep 10, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 10, 2025

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 10, 2025

capitalize Ginkgo where appropriate.

d14c20b

BenTheElder added 3 commits September 10, 2025 14:22

update failure message

e1aa846

n/a graduation criteria, mark kep implementable

4eaa208

check n/a or completed steps

512e189

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 10, 2025

reviewers => approvers

e03a9de

add jpbzetz for PRR / api-machinery

2c1b412

k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 10, 2025

pohly reviewed Sep 11, 2025

View reviewed changes

BenTheElder added 3 commits September 15, 2025 19:57

update nit(s)

58a8c93

include SIG alongside owners

159c5e5

mark invariant testing direct to "stable", as it does not contain any…

cfcb33f

… production changes and has no graduation criteria

k8s-ci-robot requested a review from jpbetz September 16, 2025 22:49

vrutkovs reviewed Sep 17, 2025

View reviewed changes

k8s-ci-robot assigned aojea Sep 18, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 18, 2025

		@@ -0,0 +1,970 @@
		<!--
		Note: When your KEP is complete, all of these comment blocks should be removed.

	> Owners for this metric: <github-handles>
	> Owners for this metric: <github-handles and/or SIG>

KEP-5468: Invariant Testing #5514

Are you sure you want to change the base?

KEP-5468: Invariant Testing #5514

Conversation

BenTheElder commented Sep 5, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BenTheElder commented Sep 5, 2025

Uh oh!

pohly commented Sep 5, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aojea commented Sep 10, 2025

Uh oh!

richabanker commented Sep 10, 2025

Uh oh!

richabanker commented Sep 10, 2025

Uh oh!

BenTheElder commented Sep 10, 2025

Uh oh!

pohly left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BenTheElder Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BenTheElder Sep 16, 2025 •

edited

Loading