Skip to content

Conversation

DanielOliverRJ
Copy link
Contributor

@DanielOliverRJ DanielOliverRJ commented Aug 13, 2025

LS-20087 Powered by Pull Request Badge

A customer has experienced restores from Azure Blob store to Azure VMs becoming unreliable, where file transfers would appear to be terminated after some period of idleness. It is not clear what the cause of the idleness is, or what has caused the change in behaviour, as it if affecting all the customer's environments at all scales.

To test this change, it is necessary to interrupt TCP connections, causing them to be terminated and the recovery trigger. To perform this test, I triggered a restore and then used iftop to identify high-traffic active TCP connections to Azure. Using the ephemeral port number of an active connection, I then added firewall rules to drop/reject the connection. On iptables EL7 systems I used:

P=53320; iptables -I INPUT 1 -p tcp --dport $P -j REJECT --reject-with tcp-reset; iptables -I OUTPUT 1 -p tcp --sport $P -j REJECT --reject-with tcp-reset

On EL9 nft systems I used:

sed -e 's/PORTTODROP/33020/g' rules-template > rules; nft flush ruleset inet; nft -f ~/rules

Where rules-template contained blocks like
chain filter_INPUT {
  type filter hook input priority filter + 10; policy accept;
  tcp sport PORTTODROP drop
  tcp dport PORTTODROP drop

This would cause the TCP connection to hang and be reaped by the kernel's TCP keepalive config. These were monitored with

ss -at --extended | grep 20.209.31.129

ESTAB     0      0        10.10.71.31:56646    20.209.31.129:https timer:(keepalive,18sec,0) uid:1736 ino:2198999 sk:604c cgroup:/user.slice/user-5108.slice/session-13.scope <->

The default kernel TCP keepalive config is

sysctl -a | grep keepalive

net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

However, pbm sets the initial time and interval to 30 seconds. The number of probes before the connection is terminates uses the kernel configured value, so hung connections take about 9*30s=4.5mins to be reaped.

Where a single TCP port is blocked, hung connections are retried:

2025-08-13T09:27:29.768+0000	X  1.62GB
2025-08-13T09:27:32.768+0000	X  1.62GB
2025-08-13T09:27:35.768+0000	X  1.62GB
2025-08-13T09:27:38.768+0000	X  1.62GB
2025-08-13T09:27:40.000+0000 D [restore/2025-08-13T09:26:17.523785788Z] Read from Azure failed (attempt 1): read tcp 10.10.71.31:33020->20.209.31.129:443: read: connection timed out, retrying: true
2025-08-13T09:27:41.767+0000	X  1.75GB
2025-08-13T09:27:44.767+0000	X 2.03GB
2025-08-13T09:27:47.767+0000	X  2.34GB

Where port 443 is blocked (i.e. all retries fail), the restore still eventually fails:

2025-08-13T10:44:59.000+0000 D [restore/2025-08-13T10:43:29.948758435Z] Read from Azure failed (attempt 1): read tcp 10.10.71.31:41368->20.209.31.129:443: read: connection timed out, retrying: true
...
2025-08-13T10:56:15.466+0000	X  2.44GB
2025-08-13T10:56:15.466+0000	finished restoring X (222999 documents, 0 failures)
2025-08-13T10:56:15.466+0000	demux finishing when there are still outs (1)
2025-08-13T10:56:15.466+0000	demux finishing (err:corruption found in archive; I/O error reading length or terminator ( compose: write namespaces: split: read bson: Get "https://X.blob.core.windows.net/X": dial tcp 20.209.31.129:443: i/o timeout ))
2025-08-13T10:56:15.000+0000 E [restore/2025-08-13T10:43:29.948758435Z] restore: mongorestore: restore mongo dump (successes: 222999 / fails: 0): X: error restoring from archive on stdin: reading bson input: error demultiplexing archive; archive io error

@DanielOliverRJ DanielOliverRJ changed the title Use RetryReader when reading from Azure PBM-1593 Use RetryReader when reading from Azure Aug 13, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements retry logic for Azure blob storage reads to address customer issues with unreliable restores from Azure Blob store to Azure VMs, where file transfers were being terminated after periods of idleness.

  • Replaces direct use of response body with Azure's RetryReader wrapper
  • Adds debug logging for failed read attempts with retry information
  • Configures early close detection as an error to trigger retries

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant