Skip to content

Conversation

tiffanny29631
Copy link
Contributor

@tiffanny29631 tiffanny29631 commented Jul 11, 2025

Remove the readiness probe and health check from otel-agent containers in reconciler, reconciler-manager, and resourcegroup controller to align container behavior. The healthcheck component can fail to bind to the port or respond under CPU throttling, causing unnecessary pod unready states even when the container is running. The otel-collector health check in the config-management-monitoring namespace is retained since it is tied to a Service.

@google-oss-prow google-oss-prow bot requested review from Camila-B and sdowell July 11, 2025 23:38
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @tiffanny29631, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the robustness of the otel-agent deployment by implementing Kubernetes startupProbe definitions. This change provides a dedicated grace period for the agent to fully initialize, mitigating issues where slow startup times could lead to unnecessary pod restarts and improve overall stability of the service.

Highlights

  • Kubernetes Probe Configuration: Introduced startupProbe configurations for the otel-agent across multiple Kubernetes manifest files. This new probe specifically targets port 13133 with an httpGet check, configured to allow up to 5 minutes for the agent to initialize before standard readiness and liveness checks begin.
  • Reliability Improvement: The addition of the startupProbe aims to prevent premature restarts and CrashLoopBackOff errors for the otel-agent, especially in environments with heavy load or resource constraints where the agent might take longer to bind to its health check port.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a startupProbe to the otel-agent container for improved startup reliability. To improve maintainability, I've suggested centralizing the startupProbe configuration to avoid duplication across multiple manifest files.

@mikebz mikebz requested a review from Copilot July 12, 2025 16:40
Copilot

This comment was marked as outdated.

@@ -195,6 +195,13 @@ data:
volumeMounts:
- name: otel-agent-config-reconciler-vol
mountPath: /conf
startupProbe:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify what the intent is with this change? The commit message mentions liveness probe, however there is no liveness probe defined for these containers (since 2023).

Also I expect the readinessProbe on these containers may be superfluous as none of these Deployments are exposed through a Service (except for otel-collector Deployment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in http://b/419026380#comment3. The Readiness probe for otel-agent side car can also be removed to align with the config of other containers. The readiness failure of telemetry containers is not necessary, and could block the reconciler and resource group controller pods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The readiness failure of telemetry containers is not necessary, and could block the reconciler and resource group controller pods.

Could you explain why the readiness probe blocks the reconciler and resource-group-controller Pods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR has been updated to removal of the readiness probe of otel-agent side car in reconciler / reconciler-manager / resource group controller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR has been updated to removal of the readiness probe of otel-agent side car in reconciler / reconciler-manager / resource group controller.

The motivation and intent for this change is still unclear. Can you explain the why of this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test failure was inconsistent and flaky between runs. You can see the test history in the PR email thread.

Previously even after completely removing the readiness configuration, some tests still showed the resource-group-controller and reconciler-manager pods with readiness settings. This caused failures because the port/health_check extension was already removed.

When I ran the same tests on my local kind cluster, this issue didn't happen, so I can't identify the root cause yet. Suggest we could do this in two steps - first remove the readiness probe, then remove the port & extension infrastructure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait I might have reproduced it, allow some more time to figure out the rootcause.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are now passing, change is ready for review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! As discussed previously let's address the concerns around upgrades before proceeding further

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've filed b/443790079 to track this. I recommend we merge PR #1834 first, as it's an independent test fix and unrelated to the readinessProbe changes.

@tiffanny29631 tiffanny29631 changed the title Add startupProbe to otel-agent for reliable startup Remove readiness probe and health check from otel-agent sidecars Jul 16, 2025
@google-oss-prow google-oss-prow bot added size/M and removed size/S labels Jul 16, 2025
@mikebz mikebz requested a review from Copilot July 21, 2025 23:24
Copilot

This comment was marked as outdated.

@google-oss-prow google-oss-prow bot added size/S and removed size/M labels Aug 6, 2025
@tiffanny29631 tiffanny29631 changed the title Remove readiness probe and health check from otel-agent sidecars Remove readiness probe otel-agent sidecars Aug 7, 2025
@google-oss-prow google-oss-prow bot added size/M and removed size/S labels Aug 9, 2025
@tiffanny29631 tiffanny29631 force-pushed the otel-agent-resource branch 2 times, most recently from 5bb86cc to 8695de5 Compare August 12, 2025 17:03
@google-oss-prow google-oss-prow bot added size/L and removed size/M labels Aug 12, 2025
@tiffanny29631
Copy link
Contributor Author

/retest-required

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from tiffanny29631. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot added size/L and removed size/M labels Sep 3, 2025
@google-oss-prow google-oss-prow bot added size/M and removed size/L labels Sep 3, 2025
@tiffanny29631 tiffanny29631 force-pushed the otel-agent-resource branch 2 times, most recently from 4301409 to 745f07f Compare September 4, 2025 22:01
@google-oss-prow google-oss-prow bot added size/L and removed size/M labels Sep 4, 2025
@tiffanny29631 tiffanny29631 force-pushed the otel-agent-resource branch 3 times, most recently from d071b26 to e08dd3f Compare September 5, 2025 00:43
To guarantee the e2e client claims ownership of all fields for objects that might have drifted, use client-side apply when reinstalling Config Sync and Webhook fter ConfigManagement was previously installed and removed.
Copilot

This comment was marked as outdated.

The readiness probe on the otel-agent container was causing operational issues:
- False alarms during slow cluster startup due to health check binding failures
- Inconsistent with other containers (git-sync, reconciler) which don't use readiness probes
- Redundant for a telemetry sidecar that doesn't provide direct user-facing services
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR removes readiness probes and health check configurations from otel-agent sidecars across reconciler, reconciler-manager, and resourcegroup controllers. The change prevents unnecessary pod unready states that can occur when the health check component fails under CPU throttling or port binding issues.

  • Removes health_check extension configuration from otel-agent ConfigMaps
  • Removes readiness probes and port 13133 from container specifications
  • Updates test configurations and helper methods to align with the new installation approach

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
test/kustomization/expected.yaml Updates expected test output removing health check configurations
manifests/templates/resourcegroup-manifest.yaml Removes health check from resourcegroup otel-agent
manifests/templates/reconciler-manager.yaml Removes readiness probe from reconciler-manager otel-agent
manifests/templates/reconciler-manager-configmap.yaml Removes health check configuration from reconciler template
manifests/otel-agent-reconciler-cm.yaml Removes health check from reconciler otel-agent ConfigMap
manifests/otel-agent-cm.yaml Removes health check from base otel-agent ConfigMap
e2e/testcases/cli_test.go Updates test methods to use new installation approach
e2e/nomostest/config_sync.go Adds new installation method using direct manifest application

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@@ -1451,11 +1452,11 @@ func TestNomosMigrate(t *testing.T) {
configmanagement.RGControllerName, configmanagement.RGControllerNamespace)
})
tg.Go(func() error {
return nt.Watcher.WatchForNotFound(kinds.Deployment(),
return nt.Watcher.WatchForCurrentStatus(kinds.Deployment(),
Copy link
Preview

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change from WatchForNotFound to WatchForCurrentStatus appears to fundamentally alter the test behavior from waiting for deletion to waiting for a current status. This seems unrelated to the health check removal and could impact test reliability.

Copilot uses AI. Check for mistakes.

core.RootReconcilerName(configsync.RootSyncName), configsync.ControllerNamespace)
})
tg.Go(func() error {
return nt.Watcher.WatchForNotFound(kinds.RootSyncV1Beta1(),
return nt.Watcher.WatchForCurrentStatus(kinds.RootSyncV1Beta1(),
Copy link
Preview

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change from WatchForNotFound to WatchForCurrentStatus appears to fundamentally alter the test behavior from waiting for deletion to waiting for a current status. This seems unrelated to the health check removal and could impact test reliability.

Suggested change
return nt.Watcher.WatchForCurrentStatus(kinds.RootSyncV1Beta1(),
return nt.Watcher.WatchForNotFound(kinds.RootSyncV1Beta1(),

Copilot uses AI. Check for mistakes.

@@ -1861,11 +1863,11 @@ func TestACMUninstallScript(t *testing.T) {
configmanagement.RGControllerName, configmanagement.RGControllerNamespace)
})
tg.Go(func() error {
return nt.Watcher.WatchForNotFound(kinds.Deployment(),
return nt.Watcher.WatchForCurrentStatus(kinds.Deployment(),
Copy link
Preview

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change from WatchForNotFound to WatchForCurrentStatus appears to fundamentally alter the test behavior from waiting for deletion to waiting for a current status. This seems unrelated to the health check removal and could impact test reliability.

Suggested change
return nt.Watcher.WatchForCurrentStatus(kinds.Deployment(),
return nt.Watcher.WatchForNotFound(kinds.Deployment(),

Copilot uses AI. Check for mistakes.

core.RootReconcilerName(configsync.RootSyncName), configsync.ControllerNamespace)
})
tg.Go(func() error {
return nt.Watcher.WatchForNotFound(kinds.RootSyncV1Beta1(),
return nt.Watcher.WatchForCurrentStatus(kinds.RootSyncV1Beta1(),
Copy link
Preview

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change from WatchForNotFound to WatchForCurrentStatus appears to fundamentally alter the test behavior from waiting for deletion to waiting for a current status. This seems unrelated to the health check removal and could impact test reliability.

Suggested change
return nt.Watcher.WatchForCurrentStatus(kinds.RootSyncV1Beta1(),
return nt.Watcher.WatchForNotFound(kinds.RootSyncV1Beta1(),

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants