PBM-1593 Use RetryReader when reading from Azure #1177

DanielOliverRJ · 2025-08-13T12:17:38Z

A customer has experienced restores from Azure Blob store to Azure VMs becoming unreliable, where file transfers would appear to be terminated after some period of idleness. It is not clear what the cause of the idleness is, or what has caused the change in behaviour, as it if affecting all the customer's environments at all scales.

To test this change, it is necessary to interrupt TCP connections, causing them to be terminated and the recovery trigger. To perform this test, I triggered a restore and then used iftop to identify high-traffic active TCP connections to Azure. Using the ephemeral port number of an active connection, I then added firewall rules to drop/reject the connection. On iptables EL7 systems I used:

P=53320; iptables -I INPUT 1 -p tcp --dport $P -j REJECT --reject-with tcp-reset; iptables -I OUTPUT 1 -p tcp --sport $P -j REJECT --reject-with tcp-reset

On EL9 nft systems I used:

sed -e 's/PORTTODROP/33020/g' rules-template > rules; nft flush ruleset inet; nft -f ~/rules

Where rules-template contained blocks like
chain filter_INPUT {
  type filter hook input priority filter + 10; policy accept;
  tcp sport PORTTODROP drop
  tcp dport PORTTODROP drop

This would cause the TCP connection to hang and be reaped by the kernel's TCP keepalive config. These were monitored with

ss -at --extended | grep 20.209.31.129

ESTAB     0      0        10.10.71.31:56646    20.209.31.129:https timer:(keepalive,18sec,0) uid:1736 ino:2198999 sk:604c cgroup:/user.slice/user-5108.slice/session-13.scope <->

The default kernel TCP keepalive config is

sysctl -a | grep keepalive

net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

However, pbm sets the initial time and interval to 30 seconds. The number of probes before the connection is terminates uses the kernel configured value, so hung connections take about 9*30s=4.5mins to be reaped.

Where a single TCP port is blocked, hung connections are retried:

2025-08-13T09:27:29.768+0000	X  1.62GB
2025-08-13T09:27:32.768+0000	X  1.62GB
2025-08-13T09:27:35.768+0000	X  1.62GB
2025-08-13T09:27:38.768+0000	X  1.62GB
2025-08-13T09:27:40.000+0000 D [restore/2025-08-13T09:26:17.523785788Z] Read from Azure failed (attempt 1): read tcp 10.10.71.31:33020->20.209.31.129:443: read: connection timed out, retrying: true
2025-08-13T09:27:41.767+0000	X  1.75GB
2025-08-13T09:27:44.767+0000	X 2.03GB
2025-08-13T09:27:47.767+0000	X  2.34GB

Where port 443 is blocked (i.e. all retries fail), the restore still eventually fails:

2025-08-13T10:44:59.000+0000 D [restore/2025-08-13T10:43:29.948758435Z] Read from Azure failed (attempt 1): read tcp 10.10.71.31:41368->20.209.31.129:443: read: connection timed out, retrying: true
...
2025-08-13T10:56:15.466+0000	X  2.44GB
2025-08-13T10:56:15.466+0000	finished restoring X (222999 documents, 0 failures)
2025-08-13T10:56:15.466+0000	demux finishing when there are still outs (1)
2025-08-13T10:56:15.466+0000	demux finishing (err:corruption found in archive; I/O error reading length or terminator ( compose: write namespaces: split: read bson: Get "https://X.blob.core.windows.net/X": dial tcp 20.209.31.129:443: i/o timeout ))
2025-08-13T10:56:15.000+0000 E [restore/2025-08-13T10:43:29.948758435Z] restore: mongorestore: restore mongo dump (successes: 222999 / fails: 0): X: error restoring from archive on stdin: reading bson input: error demultiplexing archive; archive io error

Copilot

Pull Request Overview

This PR implements retry logic for Azure blob storage reads to address customer issues with unreliable restores from Azure Blob store to Azure VMs, where file transfers were being terminated after periods of idleness.

Replaces direct use of response body with Azure's RetryReader wrapper
Adds debug logging for failed read attempts with retry information
Configures early close detection as an error to trigger retries

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

pbm/storage/azure/azure.go

Co-authored-by: Copilot <[email protected]>

DanielOliverRJ changed the title ~~Use RetryReader when reading from Azure~~ PBM-1593 Use RetryReader when reading from Azure Aug 13, 2025

Use RetryReader when reading from Azure

3639bee

DanielOliverRJ force-pushed the LS-20087 branch from bcf1af7 to 3639bee Compare August 13, 2025 12:33

DanielOliverRJ marked this pull request as ready for review August 13, 2025 15:22

DanielOliverRJ requested review from boris-ilijic and inelpandzic as code owners August 13, 2025 15:22

radoslawszulgo requested a review from Copilot August 22, 2025 13:25

Copilot AI reviewed Aug 22, 2025

View reviewed changes

pbm/storage/azure/azure.go Outdated Show resolved Hide resolved

Accept suggestion from Copilot

d2a5ed6

Co-authored-by: Copilot <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PBM-1593 Use RetryReader when reading from Azure #1177

PBM-1593 Use RetryReader when reading from Azure #1177

DanielOliverRJ commented Aug 13, 2025 •

edited by pull-request-badge bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

PBM-1593 Use RetryReader when reading from Azure #1177

Are you sure you want to change the base?

PBM-1593 Use RetryReader when reading from Azure #1177

Conversation

DanielOliverRJ commented Aug 13, 2025 • edited by pull-request-badge bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

DanielOliverRJ commented Aug 13, 2025 •

edited by pull-request-badge bot

Loading