PBM-1593 Use RetryReader when reading from Azure #1177
+8
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A customer has experienced restores from Azure Blob store to Azure VMs becoming unreliable, where file transfers would appear to be terminated after some period of idleness. It is not clear what the cause of the idleness is, or what has caused the change in behaviour, as it if affecting all the customer's environments at all scales.
To test this change, it is necessary to interrupt TCP connections, causing them to be terminated and the recovery trigger. To perform this test, I triggered a restore and then used
iftop
to identify high-traffic active TCP connections to Azure. Using the ephemeral port number of an active connection, I then added firewall rules to drop/reject the connection. Oniptables
EL7 systems I used:On EL9
nft
systems I used:This would cause the TCP connection to hang and be reaped by the kernel's TCP keepalive config. These were monitored with
The default kernel TCP keepalive config is
However,
pbm
sets the initial time and interval to 30 seconds. The number of probes before the connection is terminates uses the kernel configured value, so hung connections take about9*30s=4.5mins
to be reaped.Where a single TCP port is blocked, hung connections are retried:
Where port 443 is blocked (i.e. all retries fail), the restore still eventually fails: