-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Stop failing remembered entity if supervisor strategy failed #7720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop failing remembered entity if supervisor strategy failed #7720
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self review
|
||
namespace Akka.Cluster.Sharding; | ||
|
||
public class ShardSupervisionStrategy: OneForOneStrategy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the custom supervisor strategy, only for the Shard
actor
if(restart) | ||
context.Self.Tell(new ExcessiveSupervisorRestartPassivation(child, WithinTimeRangeMilliseconds, MaxNumberOfRetries, cause)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ProcessFailure
code is pretty much the same as the OneToOneStrategy
with these additional lines, we send the Shard
actor a warning message that this failing child is due for termination because it is thrashing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not quite right - we also need to handle scenarios where the SupervisorStrategy
decided to issue a Stop
directive. Reason being: if the actor failed in such a way that it has to be stopped, it's by definition an irrecoverable exception. Continuing to remember the entity after we get this type of signal back is net-destructive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: we need to make sure this only applies to entity actors, not to any other children of the Shard
, such as the RE infrastructure itself. You can determine this by checking the actor paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: we need to make sure this only applies to entity actors, not to any other children of the Shard, such as the RE infrastructure itself. You can determine this by checking the actor paths.
nevermind, this gets handled inside the Shard
message handlers by the looks of things
|
||
internal sealed record ExcessiveSupervisorRestartPassivation(IActorRef Child, int TimeWindowInMilliseconds, int MaxRestartCount, Exception LastCause) : IShardRegionCommand; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New internal local only message from supervisor strategy to Shard
actor
A note here, while this design works in a very quiet system, it might still somehow fail on a very busy system. This scheme would not be as responsive as what the unit test shows if the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs some changes in the way the supervision strategy is implemented
if(restart) | ||
context.Self.Tell(new ExcessiveSupervisorRestartPassivation(child, WithinTimeRangeMilliseconds, MaxNumberOfRetries, cause)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not quite right - we also need to handle scenarios where the SupervisorStrategy
decided to issue a Stop
directive. Reason being: if the actor failed in such a way that it has to be stopped, it's by definition an irrecoverable exception. Continuing to remember the entity after we get this type of signal back is net-destructive.
OK, all fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
public object StopMessage { get; } | ||
} | ||
|
||
internal sealed record SupervisorStopDirectivePassivation(IActorRef Child, string Reason, Exception LastCause) : IShardRegionCommand; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
{ | ||
if (restart && stats.RequestRestartPermission(MaxNumberOfRetries, WithinTimeRangeMilliseconds)) | ||
RestartChild(child, cause, suspendFirst: false); | ||
else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ufus/akka.net into akkadotnet#7629-fix-dying-RE-actor
Fixes #7629
Changes
SupervisorStrategy
settings toShardSupervisionStrategy
(accessible only via C# fluent API)Checklist
For significant changes, please ensure that the following have been completed (delete if not relevant):
remember-entities
and actors who can't start up correctly #7629