Skip to content

Conversation

dcantah
Copy link
Member

@dcantah dcantah commented Sep 2, 2025

Today it's possible we receive the container exit event sent to the daemon after a user might've invoked container delete. The aftermath of that unfortunate scenario is that you would be met with an error stating we can't delete the container because it is running. This is not really common unless someone was doing kill/stop+delete in some script, but it still is an issue that needs to be fixed. The true source of truth for state is actually the sandboxservice hosting the container, so if we were guaranteed that that is still alive, we can query and take some action based on the possible mismatch in current snapshot state and what the sb service is telling us.

To make this possible I've introduced a new rpc on the sandbox service named shutdown. Shutdown's purpose is to run some cleanup code (if there is any) and then to wind down the process. Prior to this RPC the sandbox service would basically just call exit(0) on its own accord after the container exited. Now this process is driven by the APIServer instead, which I like a little bit more. The true win here though is that because we know as long as we haven't sent the shutdown RPC that the process is alive (barring a bug), we can query the state of the container the whole time.

Today it's possible we receive the container exit event sent to the
daemon after a user might've invoked `container delete`. The aftermath
of that unfortunate scenario is that you would be met with an error stating
we can't delete the container because it is running. This is not really common
unless someone was doing kill/stop+delete in some script, but it still is an issue
that needs to be fixed. The true source of truth for state is actually the sandboxservice
hosting the container, so if we were guaranteed that that is still alive, we can
query and take some action based on the possible mismatch in current snapshot state
and what the sb service is telling us.

To make this possible I've introduced a new rpc on the sandbox service named shutdown.
Shutdown's purpose is to run some cleanup code (if there is any) and then to wind down
the process. Prior to this RPC the sandbox service would basically just call exit(0)
on its own accord after the container exited. Now this process is driven by the APIServer
instead, which I like a little bit more. The true win here though is that because we know
as long as we haven't sent the shutdown RPC that the process is alive (barring a bug), we
can query the state of the container the whole time.
@dcantah dcantah marked this pull request as ready for review September 2, 2025 17:29
networks: []
)
await self.setContainer(id, snapshot, context: context)
if !autoRemove {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps a one-line explanation of why we don't invoke this for the auto-remove case? who's responsible for cleanup?

return
}

if !force {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps L243-L258 in an else, and then 237-240 can go away in favor of 259-262?


Task {
do {
try await Task.sleep(for: .seconds(3))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're in the .shuttingDown state, the state going back to the containers helper will be .stopped, right? Is there any issue with the state being that way for 3 secs while the launch agent is still running? I'm thinking of a new container with the same name being created/run immediately.

@dcantah dcantah marked this pull request as draft September 9, 2025 05:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants