Skip to content

Conversation

Arkatufus
Copy link
Contributor

@Arkatufus Arkatufus commented Jul 2, 2025

Fixes #7629

Changes

  • Add specialized shard supervision strategy with feedback mechanism to signal excessive failures
  • Add new SupervisorStrategy settings to ShardSupervisionStrategy (accessible only via C# fluent API)

Checklist

For significant changes, please ensure that the following have been completed (delete if not relevant):

Copy link
Contributor Author

@Arkatufus Arkatufus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self review


namespace Akka.Cluster.Sharding;

public class ShardSupervisionStrategy: OneForOneStrategy
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the custom supervisor strategy, only for the Shard actor

Comment on lines 63 to 64
if(restart)
context.Self.Tell(new ExcessiveSupervisorRestartPassivation(child, WithinTimeRangeMilliseconds, MaxNumberOfRetries, cause));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ProcessFailure code is pretty much the same as the OneToOneStrategy with these additional lines, we send the Shard actor a warning message that this failing child is due for termination because it is thrashing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not quite right - we also need to handle scenarios where the SupervisorStrategy decided to issue a Stop directive. Reason being: if the actor failed in such a way that it has to be stopped, it's by definition an irrecoverable exception. Continuing to remember the entity after we get this type of signal back is net-destructive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: we need to make sure this only applies to entity actors, not to any other children of the Shard, such as the RE infrastructure itself. You can determine this by checking the actor paths.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: we need to make sure this only applies to entity actors, not to any other children of the Shard, such as the RE infrastructure itself. You can determine this by checking the actor paths.

nevermind, this gets handled inside the Shard message handlers by the looks of things

Comment on lines 63 to 64

internal sealed record ExcessiveSupervisorRestartPassivation(IActorRef Child, int TimeWindowInMilliseconds, int MaxRestartCount, Exception LastCause) : IShardRegionCommand;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New internal local only message from supervisor strategy to Shard actor

@Arkatufus
Copy link
Contributor Author

A note here, while this design works in a very quiet system, it might still somehow fail on a very busy system.

This scheme would not be as responsive as what the unit test shows if the ExcessiveSupervisorRestartPassivation message somehow got burried in the Shard mailbox (busy system).

Copy link
Member

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs some changes in the way the supervision strategy is implemented

Comment on lines 63 to 64
if(restart)
context.Self.Tell(new ExcessiveSupervisorRestartPassivation(child, WithinTimeRangeMilliseconds, MaxNumberOfRetries, cause));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not quite right - we also need to handle scenarios where the SupervisorStrategy decided to issue a Stop directive. Reason being: if the actor failed in such a way that it has to be stopped, it's by definition an irrecoverable exception. Continuing to remember the entity after we get this type of signal back is net-destructive.

@Arkatufus
Copy link
Contributor Author

OK, all fixed

Copy link
Member

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

public object StopMessage { get; }
}

internal sealed record SupervisorStopDirectivePassivation(IActorRef Child, string Reason, Exception LastCause) : IShardRegionCommand;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

{
if (restart && stats.RequestRestartPermission(MaxNumberOfRetries, WithinTimeRangeMilliseconds))
RestartChild(child, cause, suspendFirst: false);
else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Aaronontheweb Aaronontheweb enabled auto-merge (squash) July 7, 2025 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Akka.Cluster.Sharding: dealing with remember-entities and actors who can't start up correctly
2 participants