Issues with MongoDB replication over an Azure VNet-to-VNet VPN

Has anyone had issues with replication over an Azure VNet-to-VNet VPN?

Have have a repl set deployed to Azure. Each VM is a DSv2 with SSD storage, running Ubuntu 14.04, MongoDB 3.2.11. Two servers (server 1 & 2) are running in North Central US data center and the third (server 3) is running in Central US data center. I’ve setup a VNet-to-VNet VPN to connect the two.

This worked great for about two weeks, then I started seeing an increasing number of “9001 socket exception [SEND_ERROR] server” errors in the logs. It started off a few socket errors every couple of hours, but now I’m getting 30 to 40 socket errors every 10 minutes. Server 3 is basically not replicating because of all these errors and is very far behind.

I’ve checked the ulimit, all servers have keep alive set to 120 seconds and THP is disabled. Pings are quick, around 17ms. I do not have any capped collections.

Yesterday, I stopped server 3, remove all data file to force a resync. The server synced in about an 35 minutes, but then when right back to the same issue. I did not get a socket exception error during the resync.

Any ideas?

Thanks in advance!

Matt

Update:

I decided to look into date/time on all 3 servers to make sure they were in sync.
Server 1 and 3 had the correct time. Server 2 (MongoDB primary) was about 2 minutes in the past… So, I updated ntp to the latest on all servers, and restarted the ntp service. At that point, MongoDB failed over and server 1 was now the primary, but I still had the issue on server 3. I checked rs.status and noticed that server 3 was still syncing to server 2, which in now secondary. I changed it to sync to server 1 and boom, issue resolved.

This is still a bit odd for a few reason.
1. server 1 and server 2 never had sync issues, even with the time miss-match. I’m not sure if the time really matters all that much.
2. If it’s an issue on server 2, why wouldn’t server 1 have had the same issue as server 3? It could be that increased network latency on the VPN causes this issue to snowball, but not really sure.

If anyone has any ideas, please let me know.

Related:


Leave a Reply