Stop failing remembered entity if supervisor strategy failed #7720

Arkatufus · 2025-07-02T14:21:31Z

Fixes #7629

Changes

Add specialized shard supervision strategy with feedback mechanism to signal excessive failures
Add new SupervisorStrategy settings to ShardSupervisionStrategy (accessible only via C# fluent API)

Checklist

For significant changes, please ensure that the following have been completed (delete if not relevant):

This change follows the Akka.NET API Compatibility Guidelines.
I have reviewed my own pull request.
Design discussion issue Akka.Cluster.Sharding: dealing with remember-entities and actors who can't start up correctly #7629
Changes in public API reviewed, if any.

Arkatufus

Self review

Arkatufus · 2025-07-02T14:48:09Z

src/contrib/cluster/Akka.Cluster.Sharding/ShardSupervisionStrategy.cs

+
+namespace Akka.Cluster.Sharding;
+
+public class ShardSupervisionStrategy: OneForOneStrategy


This is the custom supervisor strategy, only for the Shard actor

Arkatufus · 2025-07-02T14:51:05Z

src/contrib/cluster/Akka.Cluster.Sharding/ShardSupervisionStrategy.cs

+            if(restart)
+                context.Self.Tell(new ExcessiveSupervisorRestartPassivation(child, WithinTimeRangeMilliseconds, MaxNumberOfRetries, cause));


The ProcessFailure code is pretty much the same as the OneToOneStrategy with these additional lines, we send the Shard actor a warning message that this failing child is due for termination because it is thrashing.

This is not quite right - we also need to handle scenarios where the SupervisorStrategy decided to issue a Stop directive. Reason being: if the actor failed in such a way that it has to be stopped, it's by definition an irrecoverable exception. Continuing to remember the entity after we get this type of signal back is net-destructive.

Also: we need to make sure this only applies to entity actors, not to any other children of the Shard, such as the RE infrastructure itself. You can determine this by checking the actor paths.

Also: we need to make sure this only applies to entity actors, not to any other children of the Shard, such as the RE infrastructure itself. You can determine this by checking the actor paths.

nevermind, this gets handled inside the Shard message handlers by the looks of things

Arkatufus · 2025-07-02T14:51:45Z

src/contrib/cluster/Akka.Cluster.Sharding/ShardingMessages.cs


+    internal sealed record ExcessiveSupervisorRestartPassivation(IActorRef Child, int TimeWindowInMilliseconds, int MaxRestartCount, Exception LastCause) : IShardRegionCommand;


New internal local only message from supervisor strategy to Shard actor

Arkatufus · 2025-07-02T14:55:56Z

A note here, while this design works in a very quiet system, it might still somehow fail on a very busy system.

This scheme would not be as responsive as what the unit test shows if the ExcessiveSupervisorRestartPassivation message somehow got burried in the Shard mailbox (busy system).

Aaronontheweb

Needs some changes in the way the supervision strategy is implemented

Aaronontheweb · 2025-07-03T16:53:37Z

src/contrib/cluster/Akka.Cluster.Sharding/ShardSupervisionStrategy.cs

+            if(restart)
+                context.Self.Tell(new ExcessiveSupervisorRestartPassivation(child, WithinTimeRangeMilliseconds, MaxNumberOfRetries, cause));


This is not quite right - we also need to handle scenarios where the SupervisorStrategy decided to issue a Stop directive. Reason being: if the actor failed in such a way that it has to be stopped, it's by definition an irrecoverable exception. Continuing to remember the entity after we get this type of signal back is net-destructive.

Arkatufus · 2025-07-07T16:36:46Z

OK, all fixed

Aaronontheweb

LGTM

Aaronontheweb · 2025-07-07T16:48:20Z

src/contrib/cluster/Akka.Cluster.Sharding/ShardingMessages.cs

        public object StopMessage { get; }
    }

+    internal sealed record SupervisorStopDirectivePassivation(IActorRef Child, string Reason, Exception LastCause) : IShardRegionCommand;


Aaronontheweb · 2025-07-07T16:48:41Z

src/contrib/cluster/Akka.Cluster.Sharding/ShardSupervisionStrategy.cs

+    {
+        if (restart && stats.RequestRestartPermission(MaxNumberOfRetries, WithinTimeRangeMilliseconds))
+            RestartChild(child, cause, suspendFirst: false);
+        else


…ufus/akka.net into akkadotnet#7629-fix-dying-RE-actor

Arkatufus added 2 commits July 2, 2025 21:13

Stop failing remembered entity if supervisor strategy failed

f9414b6

Update API Approval list

91d87b4

Arkatufus commented Jul 2, 2025

View reviewed changes

Aaronontheweb requested changes Jul 3, 2025

View reviewed changes

Arkatufus added 2 commits July 7, 2025 23:35

Fix logic

a14fd32

Merge branch 'dev' into akkadotnet#7629-fix-dying-RE-actor

6ba4fe0

Aaronontheweb added the akka-cluster-sharding label Jul 7, 2025

Aaronontheweb approved these changes Jul 7, 2025

View reviewed changes

Aaronontheweb enabled auto-merge (squash) July 7, 2025 16:48

Arkatufus added 2 commits July 7, 2025 23:54

Fix ShardEntityFailureSpec

651f439

Merge branch 'akkadotnet#7629-fix-dying-RE-actor' of github.com:Arkat…

930e00a

…ufus/akka.net into akkadotnet#7629-fix-dying-RE-actor

Aaronontheweb merged commit ec8a419 into akkadotnet:dev Jul 7, 2025
11 checks passed

Arkatufus mentioned this pull request Jul 7, 2025

Update RELEASE_NOTES for 1.5.45 release #7723

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stop failing remembered entity if supervisor strategy failed #7720

Stop failing remembered entity if supervisor strategy failed #7720

Uh oh!

Arkatufus commented Jul 2, 2025 •

edited by Aaronontheweb

Loading

Uh oh!

Arkatufus left a comment

Uh oh!

Arkatufus Jul 2, 2025

Uh oh!

Arkatufus Jul 2, 2025

Uh oh!

Aaronontheweb Jul 3, 2025

Uh oh!

Aaronontheweb Jul 3, 2025

Uh oh!

Aaronontheweb Jul 3, 2025

Uh oh!

Arkatufus Jul 2, 2025

Uh oh!

Arkatufus commented Jul 2, 2025

Uh oh!

Aaronontheweb left a comment

Uh oh!

Aaronontheweb Jul 3, 2025

Uh oh!

Arkatufus commented Jul 7, 2025

Uh oh!

Aaronontheweb left a comment

Uh oh!

Aaronontheweb Jul 7, 2025

Uh oh!

Aaronontheweb Jul 7, 2025

Uh oh!

Uh oh!

Uh oh!


		namespace Akka.Cluster.Sharding;

		public class ShardSupervisionStrategy: OneForOneStrategy

		if(restart)
		context.Self.Tell(new ExcessiveSupervisorRestartPassivation(child, WithinTimeRangeMilliseconds, MaxNumberOfRetries, cause));


		internal sealed record ExcessiveSupervisorRestartPassivation(IActorRef Child, int TimeWindowInMilliseconds, int MaxRestartCount, Exception LastCause) : IShardRegionCommand;

Stop failing remembered entity if supervisor strategy failed #7720

Stop failing remembered entity if supervisor strategy failed #7720

Uh oh!

Conversation

Arkatufus commented Jul 2, 2025 • edited by Aaronontheweb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Checklist

Uh oh!

Arkatufus left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Arkatufus commented Jul 2, 2025

Uh oh!

Aaronontheweb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Arkatufus commented Jul 7, 2025

Uh oh!

Aaronontheweb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Arkatufus commented Jul 2, 2025 •

edited by Aaronontheweb

Loading