In continuing my on-going issues with SQL Server replication the past couple of weeks, we had a data center outage on Monday, which resulted in a forced failover to our DR data center. Our MySQL database applications successfully failed over and required no followup. We did run into 1 peer-to-peer row conflict, which I was able to address, similar to the issue late last week I documented here.
Things looked to be running along fine, and I could see that my DR database was up to date with my production database. My monitoring system was showing some undistributed transactions, but nothing that seemed too outrageous. But I was continuing to get a few Network error notifications, which lead me to look at the SQL Server Replication Monitor. Everything was green, but my undistributed transactions were showing well over an hour.
Recovering from a Network Outage
In the case of an extended network outage between primary & secondary (DR in my example), a few steps may be taken to address the situation.
Example Outage notification:
Title: SQL Server Alert System: 'Replication: agent retry' occurred on \\PROD Body: DATE/TIME: 6/30/2014 5:41:55 PM DESCRIPTION: Replication-Replication Distribution Subsystem: agent PUBLICATIONAGENT scheduled for retry. The process could not connect to Subscriber 'DR'. COMMENT: (None) JOB RUN: (None)
|1||Confirm network is back online||Check the SQL Server Event logs that the network messages have stopped.|
|2||Once replication has reestablished, you may have encountered a peer-to-peer conflict||Review steps under
SQL Server Peer-to-Peer Replication – When a Conflict is Detected
|3||Connect to the replication monitor from one of the 2 machines (DR recommended)Make sure replication has been reestablished||You may need to reconnect to the other distributor in the monitor, right click on the monitor, select “Connect to Distributor”, and enter any security details neededA happy environment:|
|4||Check that Prod & DR seem to be in sync||If you are familiar with your environment, check some of the top tables to see that the secondary is replicating as expected
|5||Check for a large number of undistributed transactions||There seems to be an issue with the Replication Agents Cleanup job where an extended delay (10+ min) will cause the job to silently fail.My Original reference: http://blogs.msdn.com/b/sqlserverfaq/archive/2010/09/17/sometimes-replication-monitor-shows-number-of-undistributed-transactions-as-a-very-high-number-but-replication-itself-works-properly.aspx
|6||Restart Agents||The first step for resolving the high number of Undistributed transactions was to restart the agents; this was fortunately the resolution in my case.Restart the agents – to be safe, this should be done after any network outage