-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-5468: Invariant Testing #5514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
5d434e5
to
7d568c8
Compare
7d568c8
to
f44f65a
Compare
f44f65a
to
0c1d1f2
Compare
# worked on. | ||
latest-milestone: "v1.35" | ||
|
||
# The milestone at which this feature was, or is targeted to be, at each stage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will update this pending above discussion about the alpha/beta/stable lifecycle not really making sense for this KEP https://github.com/kubernetes/enhancements/pull/5514/files#r2323831978
- sig-api-machinery | ||
status: provisional | ||
creation-date: 2025-09-04 | ||
reviewers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a draft of the implementation in kubernetes/kubernetes#133394, though I intend to capture all necessary details here. Mostly here we want to formalize what sort of "run after all other tests" is permitted and how we're approaching that. |
/assign |
mechanisms to select / skip them. | ||
|
||
We can then use `ginkgo.ReportAfterSuite` to inspect which, if any, of these | ||
sentinel invariant tests ran, and run the *actual* test logic, reporting pass/fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to run the invariant checking after an incomplete E2E test run? If the E2E suite timed out in the middle or got interrupted during a manual run, the sentinel test may not have run yet, thus (with the current approach) skipping the invariant checking. Given the random ordering of tests, it's non-deterministic.
I think this can be argued either way, but it should be documented. I'm leaning towards always running.
If we want to run it also in case of a partial E2E run, then I think we can: instead of checking after the run whether the sentinel did run, we check before the run whether the sentinel is meant to run. This works because ReportAfterSuite
gets called before the actual run with the result of a dry-run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to run the invariant checking after an incomplete E2E test run? If the E2E suite timed out in the middle or got interrupted during a manual run, the sentinel test may not have run yet, thus (with the current approach) skipping the invariant checking. Given the random ordering of tests, it's non-deterministic.
I don't think it matters, a partial run is best-effort regardless. This is into YAGNI territory.
There's no reason to optimize for failing to run the complete suite.
It's not really different than another test, if we timeout or get interrupted then it is non-deterministic which tests actually ran before then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if a run was interrupted, checking the invariants might still provide some relevant information. But I don't care strongly about this. Let's just document the behavior for the sake of clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fail to see a system were an incomplete job can be a good source of signal for anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just document the behavior for the sake of clarity.
The behavior is the same as any other test (best effort, random scheduling, may fail if the run is terminated early or times out), I'm not sure where we should document this. In one of the KEP sections?
Note that in CI no matter how you implement this we always have the case that tests may not run ... we do not guarantee that the entire CI pod will not be terminated. The scheduling aspect is a relatively unimportant detail of "timeouts / early interrupts will lead to incomplete results" which is true even for non-e2e CI.
|
||
Something like this, for the failure message: | ||
> Invariant failed for missing metric: <metric-name> | ||
> If this failed on a PR, check your PR for bugs, if this failed on a periodic test please file a bug and assign the owners. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't "check your PR for bugs" the wrong message and in conflict with "tragedy of the commons"? It implies that it is the responsibility of the PR owner instead of the invariant owner to figure out why the invariant was violated. If we go down this route, it becomes the responsibility of everyone to debug the feature tested by the invariants.
Invariants are most likely to break because of some change, i.e. in a PR, unless we merge some changes where tests only run in CI or in additional environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OTOH, the invariant might really have been broken by the PR.
How about this?
If this failed, search for an existing GitHub issue. It might be a known problem
that you can ignore.
Otherwise, if this failed on a periodic job, please file a bug and assign the owners.
If this failed on a PR, try to determine if it is something that could have been
caused by your changes. If not or unsure, file an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't "check your PR for bugs" the wrong message and in conflict with "tragedy of the commons"? It implies that it is the responsibility of the PR owner instead of the invariant owner to figure out why the invariant was violated. If we go down this route, it becomes the responsibility of everyone to debug the feature tested by the invariants.
We can rephrase it, but the first reaction to a failure on the PR should be to check if the PR caused the problem, NOT to go file a bug for the documented owners, that's not appropriate and not scalable.
Invariants are most likely to break because of some change, i.e. in a PR, unless we merge some changes where tests only run in CI or in additional environments.
They can start to flake in periodic testing due to some change that reduced reliability, if it fails in a periodic then we clearly should have the owners of the check follow up. That is unambiguous, versus failures on a pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, the "check your own PR for bugs" is the case for all tests we have currently.
But there will be a documented owner to escalate to. We should elaborate explicitly that after checking into the PR yourself if you can't see how it would be related then you can escalate to the owners.
But it won't be useful to have documented owners if they are pinged for every failure, it will become noise no better than tragedy of the commons.
If you touch something in a PR and the invariant check goes red, it's not that likely that you can't relate it all. The most likely cases are:
- it's clearly related to the PR (e.g. you changed validation and broke the validation check)
- it's flaking, which we'll find via periodics flaking.
For PRs, it is better in both cases that the PR author / reviewers / approvers look into the PR breaking tests first (even if it turns out to be a flake, there's not a good way to be sure, if it is flaking in periodics we hopefully will already address it by fixing or disabling the test relatively quickly).
We can add some language around getting help after attempting to identify the issue with your PR, will need to think a bit about how to phrase that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that some due diligence by the PR author is needed.
I'm not sure whether you saw my comment with the "OTOH" (#5514 (comment)). I was suggesting an update of the wording there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to tweak this a bit, the proposed update still sounds like we expect broken checks as a regular course of business, I really don't expect that and I don't want to send that message.
Because it appears in the KEP? The only time when users are going to see it is when it really has gone wrong, so I'm not worried about this.
We can also finalize the actual log message in the implementation PR(s) ...
True, but then we need to go back to the KEP and update that, or merge the KEP with an explicit comment that the proposal is just an example and that new usages should follow the implementation. But that would diminish the value of the KEP. 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it appears in the KEP? The only time when users are going to see it is when it really has gone wrong, so I'm not worried about this.
No, I mean we should not ever send the message to assume that tests failing is not related to their PR by default.
True, but then we need to go back to the KEP and update that, or merge the KEP with an explicit comment that the proposal is just an example and that new usages should follow the implementation. But that would diminish the value of the KEP. 🤷
KEPs usually aren't used to finalize implementation strings, I could delete the example here /shrug, the point of the KEP is to agree in concept to what we're permitting and how we're going about it, not to exactly what strings will be in the code ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[I did update the wording here ...]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention was to capture that we would require a failure message instructing how to handle the test, perhaps it would be better to spell out this requirements than include a possible templated message. The actual exact template inputs etc will likely best be covered in implementation review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have already covered the requirements for that without the sample message.
LGTM there are some open comments, but we have been discussing this for a long time and I think describes pretty well the overall idea, of course when moving to execution we may find some unexpected issues that we need to reconsider, but I think is important to get this early in the cycle to get feedback before the large features start to get merged and avoid to get perfection in the KEP |
/lgtm |
/hold for other comments to be resolved |
This is failing because we don't have the PRR approver metadata yet. Will reach out when we sort api-machinery, I posted in their slack last week, reached out to a lead today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, some nits and suggestions.
@@ -0,0 +1,970 @@ | |||
<!-- | |||
**Note:** When your KEP is complete, all of these comment blocks should be removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: you could make the README.md easier to review if you dropped comments which are no longer needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/shrug, this is an artifact of the KEP process, this KEP is not complete.
that these instructions are super verbose .... is another matter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can keep those which are still relevant. But below you have filled out entire sections with "not applicable" while keeping the comments. Those comments could be removed because the section is complete.
As it stands now, the README.md is mostly obsolete comments with very little actual content.
> | ||
> If this failed in a periodic CI, please file a bug and assign the owners. | ||
> | ||
> Owners for this metric: <github-handles> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> Owners for this metric: <github-handles> | |
> Owners for this metric: <github-handles and/or SIG> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we really need a list of specific owners of the feature, assigning SIGs to resolve bugs does not work for most SIGs. Maybe a few SIGs are extremely well staffed, but in general we need to require that a list of multiple owners is responsible for a feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @aojea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that we'll end up with situations where individual owners have moved on. Whom should a problem be escalated to when that happens?
Listing the SIG would also be useful for /sig ...
in the issue that we ask contributors to file.
Let's make that <github-handles and SIG(s)>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, I just think they need to not be interchangeable and that a list of owners should be required. But also documenting the SIG makes sense.
Especially when this is first introduced for a feature there should be some specific owners with context that can help ensure there are not issues. Similar to the underlying metrics (which currently list a single owner and SIG).
I also think we should consider as we approve these requiring some minimum number of owners, but I don't know what to set it to and I do want to make sure we start testing out the initial invariant early in the cycle, 1.35 will be a short cycle and we also may risk holding up declarative validation if we wind up having to revert this test.
I have pretty high confidence in the current implementation but ... unknown unknowns until we actually start soaking it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So do you want to add the "and SIG" or "and SIG(s)". If not, then I can LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, it's not ready for other reasons (PRR), as you noted in #5514 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the SIG (though we may still revisit the exact formatting/language), and I took a stab at PRR by way of just marking it direct to stable ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last time I did one of these "non-production" KEPs the process was a bit different ...
… production changes and has no graduation criteria
/cc @jpbetz |
These tests can be labeled appropriately, allowing the existing test selection | ||
mechanisms to select / skip them. | ||
|
||
We can then use `ginkgo.ReportAfterSuite` to inspect which, if any, of these |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReportAfterSuite
seems to be used mostly for custom reports, so it's not designed to run more tests.
In OpenShift we use two test annotations: [Early]
and [Late]
and split tests in the suite, running them in necessary order. Have this alternative been considered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReportAfterSuite seems to be used mostly for custom reports, so it's not designed to run more tests.
Yeah, we're not running actual tests here though, we're only fetching metrics (which we already have an option to do after the test suite) and validating some of them, if not for the metric fetch part, it fits as a custom report. The metrics fetch part is cheap and quick anyhow so 🤷♂️
We explicitly do not wish to permit anything more expensive than this, or any mutable actions, or any intra-test ordering, only checking cluster status after-suite and reporting if it's invalid.
IMHO as a general rule, tests that need to be ordered are a huge anti-pattern especially in shared cluster e2e environment, and between the SIG Testing leads we've agreed to not permit this.
We actually considered doing this entirely outside of ginkgo just as part of TestMain() and having an env opt in/out, but leveraging the test selection logic to decide if it should run simplifies getting it enabled in the right jobs.
In OpenShift we use two test annotations: [Early] and [Late] and split tests in the suite, running them in necessary order. Have this alternative been considered?
This is an interesting option, but it would be more involved for our case, since we are using ginkgo's stock runner / scheduling and otherwise have no reason to rewrite this, we plan to keep randomized order for all tests other than this special invariant check.
I think it would make sense if we planned to have more tests with special ordering, but we don't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really prefer to avoid to customize ginkgo, since that will add more surface to cover and to maintain, and we got a lot of troubles because of that when we had to migrate from v1 to v2 ... this is pretty simple and decoupled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metrics fetch part is cheap and quick anyhow so 🤷♂️
I wonder if there is a way to limit the code to stay cheap and quick?
we plan to keep randomized order for all tests other than this special invariant check
Sure, the test list can be additionally shuffled to ensure tests don't depend on special order. I acknowledge it may be using advancement ginkgo features, but it doesn't seem like a significant ginkgo modification to me.
In any case, this idea is worth adding to the alternative list imo
/lgtm I think we have an overall agreement on how to proceed, and the proposed implementation is encapsulated enough that we can always revisit without causing much problems, but at one point we need to execute and start to find the unknown unknowns ... 🤞 |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: aojea, BenTheElder The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc @pohly @aojea @richabanker