Skip to content

Conversation

BenTheElder
Copy link
Member

  • One-line PR description: Introduce invariant testing KEP.
  • Other comments: This process has a truly annoying amount of redundancy and boilerplate. We really need to think about how we can streamline things a bit. I spent far more time on this KEP than finding an agreeable solution to the problem and implementing it.

/cc @pohly @aojea @richabanker

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/testing Categorizes an issue or PR as relevant to SIG Testing. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 5, 2025
@BenTheElder BenTheElder mentioned this pull request Sep 5, 2025
4 tasks
@BenTheElder BenTheElder force-pushed the invariant branch 2 times, most recently from 5d434e5 to 7d568c8 Compare September 5, 2025 00:47
# worked on.
latest-milestone: "v1.35"

# The milestone at which this feature was, or is targeted to be, at each stage.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update this pending above discussion about the alpha/beta/stable lifecycle not really making sense for this KEP https://github.com/kubernetes/enhancements/pull/5514/files#r2323831978

- sig-api-machinery
status: provisional
creation-date: 2025-09-04
reviewers:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe @jpbetz ?

richa per #5468 (comment)

patrick and antonio as fellow SIG Testing TLs

@BenTheElder
Copy link
Member Author

There is a draft of the implementation in kubernetes/kubernetes#133394, though I intend to capture all necessary details here.

Mostly here we want to formalize what sort of "run after all other tests" is permitted and how we're approaching that.

@pohly
Copy link
Contributor

pohly commented Sep 5, 2025

/assign

mechanisms to select / skip them.

We can then use `ginkgo.ReportAfterSuite` to inspect which, if any, of these
sentinel invariant tests ran, and run the *actual* test logic, reporting pass/fail.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to run the invariant checking after an incomplete E2E test run? If the E2E suite timed out in the middle or got interrupted during a manual run, the sentinel test may not have run yet, thus (with the current approach) skipping the invariant checking. Given the random ordering of tests, it's non-deterministic.

I think this can be argued either way, but it should be documented. I'm leaning towards always running.

If we want to run it also in case of a partial E2E run, then I think we can: instead of checking after the run whether the sentinel did run, we check before the run whether the sentinel is meant to run. This works because ReportAfterSuite gets called before the actual run with the result of a dry-run.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to run the invariant checking after an incomplete E2E test run? If the E2E suite timed out in the middle or got interrupted during a manual run, the sentinel test may not have run yet, thus (with the current approach) skipping the invariant checking. Given the random ordering of tests, it's non-deterministic.

I don't think it matters, a partial run is best-effort regardless. This is into YAGNI territory.

There's no reason to optimize for failing to run the complete suite.

It's not really different than another test, if we timeout or get interrupted then it is non-deterministic which tests actually ran before then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if a run was interrupted, checking the invariants might still provide some relevant information. But I don't care strongly about this. Let's just document the behavior for the sake of clarity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fail to see a system were an incomplete job can be a good source of signal for anything

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just document the behavior for the sake of clarity.

The behavior is the same as any other test (best effort, random scheduling, may fail if the run is terminated early or times out), I'm not sure where we should document this. In one of the KEP sections?

Note that in CI no matter how you implement this we always have the case that tests may not run ... we do not guarantee that the entire CI pod will not be terminated. The scheduling aspect is a relatively unimportant detail of "timeouts / early interrupts will lead to incomplete results" which is true even for non-e2e CI.


Something like this, for the failure message:
> Invariant failed for missing metric: <metric-name>
> If this failed on a PR, check your PR for bugs, if this failed on a periodic test please file a bug and assign the owners.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't "check your PR for bugs" the wrong message and in conflict with "tragedy of the commons"? It implies that it is the responsibility of the PR owner instead of the invariant owner to figure out why the invariant was violated. If we go down this route, it becomes the responsibility of everyone to debug the feature tested by the invariants.

Invariants are most likely to break because of some change, i.e. in a PR, unless we merge some changes where tests only run in CI or in additional environments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTOH, the invariant might really have been broken by the PR.

How about this?

If this failed, search for an existing GitHub issue. It might be a known problem
that you can ignore.

Otherwise, if this failed on a periodic job, please file a bug and assign the owners.
If this failed on a PR, try to determine if it is something that could have been
caused by your changes. If not or unsure, file an issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't "check your PR for bugs" the wrong message and in conflict with "tragedy of the commons"? It implies that it is the responsibility of the PR owner instead of the invariant owner to figure out why the invariant was violated. If we go down this route, it becomes the responsibility of everyone to debug the feature tested by the invariants.

We can rephrase it, but the first reaction to a failure on the PR should be to check if the PR caused the problem, NOT to go file a bug for the documented owners, that's not appropriate and not scalable.

Invariants are most likely to break because of some change, i.e. in a PR, unless we merge some changes where tests only run in CI or in additional environments.

They can start to flake in periodic testing due to some change that reduced reliability, if it fails in a periodic then we clearly should have the owners of the check follow up. That is unambiguous, versus failures on a pull request.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the "check your own PR for bugs" is the case for all tests we have currently.

But there will be a documented owner to escalate to. We should elaborate explicitly that after checking into the PR yourself if you can't see how it would be related then you can escalate to the owners.

But it won't be useful to have documented owners if they are pinged for every failure, it will become noise no better than tragedy of the commons.

If you touch something in a PR and the invariant check goes red, it's not that likely that you can't relate it all. The most likely cases are:

  1. it's clearly related to the PR (e.g. you changed validation and broke the validation check)
  2. it's flaking, which we'll find via periodics flaking.

For PRs, it is better in both cases that the PR author / reviewers / approvers look into the PR breaking tests first (even if it turns out to be a flake, there's not a good way to be sure, if it is flaking in periodics we hopefully will already address it by fixing or disabling the test relatively quickly).
We can add some language around getting help after attempting to identify the issue with your PR, will need to think a bit about how to phrase that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that some due diligence by the PR author is needed.

I'm not sure whether you saw my comment with the "OTOH" (#5514 (comment)). I was suggesting an update of the wording there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to tweak this a bit, the proposed update still sounds like we expect broken checks as a regular course of business, I really don't expect that and I don't want to send that message.

Because it appears in the KEP? The only time when users are going to see it is when it really has gone wrong, so I'm not worried about this.

We can also finalize the actual log message in the implementation PR(s) ...

True, but then we need to go back to the KEP and update that, or merge the KEP with an explicit comment that the proposal is just an example and that new usages should follow the implementation. But that would diminish the value of the KEP. 🤷

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it appears in the KEP? The only time when users are going to see it is when it really has gone wrong, so I'm not worried about this.

No, I mean we should not ever send the message to assume that tests failing is not related to their PR by default.

True, but then we need to go back to the KEP and update that, or merge the KEP with an explicit comment that the proposal is just an example and that new usages should follow the implementation. But that would diminish the value of the KEP. 🤷

KEPs usually aren't used to finalize implementation strings, I could delete the example here /shrug, the point of the KEP is to agree in concept to what we're permitting and how we're going about it, not to exactly what strings will be in the code ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[I did update the wording here ...]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was to capture that we would require a failure message instructing how to handle the test, perhaps it would be better to spell out this requirements than include a possible templated message. The actual exact template inputs etc will likely best be covered in implementation review.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have already covered the requirements for that without the sample message.

@aojea
Copy link
Member

aojea commented Sep 10, 2025

LGTM

there are some open comments, but we have been discussing this for a long time and I think describes pretty well the overall idea, of course when moving to execution we may find some unexpected issues that we need to reconsider, but I think is important to get this early in the cycle to get feedback before the large features start to get merged and avoid to get perfection in the KEP

@richabanker
Copy link
Contributor

/lgtm
from instrumentation side

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 10, 2025
@richabanker
Copy link
Contributor

/hold

for other comments to be resolved

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 10, 2025
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 10, 2025
@BenTheElder
Copy link
Member Author

This is failing because we don't have the PRR approver metadata yet.

Will reach out when we sort api-machinery, I posted in their slack last week, reached out to a lead today.

@k8s-ci-robot k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 10, 2025
Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, some nits and suggestions.

@@ -0,0 +1,970 @@
<!--
**Note:** When your KEP is complete, all of these comment blocks should be removed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you could make the README.md easier to review if you dropped comments which are no longer needed.

Copy link
Member Author

@BenTheElder BenTheElder Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/shrug, this is an artifact of the KEP process, this KEP is not complete.
that these instructions are super verbose .... is another matter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can keep those which are still relevant. But below you have filled out entire sections with "not applicable" while keeping the comments. Those comments could be removed because the section is complete.

As it stands now, the README.md is mostly obsolete comments with very little actual content.

>
> If this failed in a periodic CI, please file a bug and assign the owners.
>
> Owners for this metric: <github-handles>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> Owners for this metric: <github-handles>
> Owners for this metric: <github-handles and/or SIG>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we really need a list of specific owners of the feature, assigning SIGs to resolve bugs does not work for most SIGs. Maybe a few SIGs are extremely well staffed, but in general we need to require that a list of multiple owners is responsible for a feature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @aojea

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that we'll end up with situations where individual owners have moved on. Whom should a problem be escalated to when that happens?

Listing the SIG would also be useful for /sig ... in the issue that we ask contributors to file.

Let's make that <github-handles and SIG(s)>?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I just think they need to not be interchangeable and that a list of owners should be required. But also documenting the SIG makes sense.

Especially when this is first introduced for a feature there should be some specific owners with context that can help ensure there are not issues. Similar to the underlying metrics (which currently list a single owner and SIG).

I also think we should consider as we approve these requiring some minimum number of owners, but I don't know what to set it to and I do want to make sure we start testing out the initial invariant early in the cycle, 1.35 will be a short cycle and we also may risk holding up declarative validation if we wind up having to revert this test.

I have pretty high confidence in the current implementation but ... unknown unknowns until we actually start soaking it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do you want to add the "and SIG" or "and SIG(s)". If not, then I can LGTM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, it's not ready for other reasons (PRR), as you noted in #5514 (comment).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the SIG (though we may still revisit the exact formatting/language), and I took a stab at PRR by way of just marking it direct to stable ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last time I did one of these "non-production" KEPs the process was a bit different ...

@BenTheElder
Copy link
Member Author

/cc @jpbetz

@k8s-ci-robot k8s-ci-robot requested a review from jpbetz September 16, 2025 22:49
These tests can be labeled appropriately, allowing the existing test selection
mechanisms to select / skip them.

We can then use `ginkgo.ReportAfterSuite` to inspect which, if any, of these

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReportAfterSuite seems to be used mostly for custom reports, so it's not designed to run more tests.

In OpenShift we use two test annotations: [Early] and [Late] and split tests in the suite, running them in necessary order. Have this alternative been considered?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReportAfterSuite seems to be used mostly for custom reports, so it's not designed to run more tests.

Yeah, we're not running actual tests here though, we're only fetching metrics (which we already have an option to do after the test suite) and validating some of them, if not for the metric fetch part, it fits as a custom report. The metrics fetch part is cheap and quick anyhow so 🤷‍♂️

We explicitly do not wish to permit anything more expensive than this, or any mutable actions, or any intra-test ordering, only checking cluster status after-suite and reporting if it's invalid.

IMHO as a general rule, tests that need to be ordered are a huge anti-pattern especially in shared cluster e2e environment, and between the SIG Testing leads we've agreed to not permit this.

We actually considered doing this entirely outside of ginkgo just as part of TestMain() and having an env opt in/out, but leveraging the test selection logic to decide if it should run simplifies getting it enabled in the right jobs.

In OpenShift we use two test annotations: [Early] and [Late] and split tests in the suite, running them in necessary order. Have this alternative been considered?

This is an interesting option, but it would be more involved for our case, since we are using ginkgo's stock runner / scheduling and otherwise have no reason to rewrite this, we plan to keep randomized order for all tests other than this special invariant check.

I think it would make sense if we planned to have more tests with special ordering, but we don't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really prefer to avoid to customize ginkgo, since that will add more surface to cover and to maintain, and we got a lot of troubles because of that when we had to migrate from v1 to v2 ... this is pretty simple and decoupled

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics fetch part is cheap and quick anyhow so 🤷‍♂️

I wonder if there is a way to limit the code to stay cheap and quick?

we plan to keep randomized order for all tests other than this special invariant check

Sure, the test list can be additionally shuffled to ensure tests don't depend on special order. I acknowledge it may be using advancement ginkgo features, but it doesn't seem like a significant ginkgo modification to me.

In any case, this idea is worth adding to the alternative list imo

@aojea
Copy link
Member

aojea commented Sep 18, 2025

/lgtm
/approve

I think we have an overall agreement on how to proceed, and the proposed implementation is encapsulated enough that we can always revisit without causing much problems, but at one point we need to execute and start to find the unknown unknowns ... 🤞

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 18, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aojea, BenTheElder
Once this PR has been reviewed and has the lgtm label, please assign wojtek-t for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants