SQL Server Replication – Recovering from a Network Outage

In continuing my on-going issues with SQL Server replication the past couple of weeks, we had a data center outage on Monday, which resulted in a forced failover to our DR data center.  Our MySQL database applications successfully failed over and required no followup.  We did run into 1 peer-to-peer row conflict, which I was able to address, similar to the issue late last week I documented here.

Things looked to be  running along fine, and I could see that my DR database was up to date with my production database.  My monitoring system was showing some undistributed transactions, but nothing that seemed too outrageous.  But I was continuing to get a few Network error notifications, which lead me to look at the SQL Server Replication Monitor.  Everything was green, but my undistributed transactions were showing well over an hour.

Recovering from a Network Outage

In the case of an extended network outage between primary & secondary (DR in my example), a few steps may be taken to address the situation.

Example Outage notification:

Title: SQL Server Alert System: 'Replication: agent retry' occurred on \\PROD

Body:
DATE/TIME:    6/30/2014 5:41:55 PM
DESCRIPTION:    Replication-Replication Distribution Subsystem: agent PUBLICATIONAGENT scheduled for retry. The process could not connect to Subscriber 'DR'.
COMMENT:    (None)
JOB RUN:    (None)

Resolution steps:

Step
Detail
1 Confirm network is back online Check the SQL Server Event logs that the network messages have stopped.
2 Once replication has reestablished, you may have encountered a peer-to-peer conflict Review steps under

SQL Server Peer-to-Peer Replication – When a Conflict is Detected

3 Connect to the replication monitor from one of the 2 machines (DR recommended)Make sure replication has been reestablished You may need to reconnect to the other distributor in the monitor, right click on the monitor, select “Connect to Distributor”, and enter any security details neededA happy environment:ReplicationMonitor-RunningTransactions-blocked 
4 Check that Prod & DR seem to be in sync If you are familiar with your environment, check some of the top tables to see that the secondary is replicating as expected

select top 5 editeddate, id from table order by editeddate desc;
5 Check for a large number of undistributed transactions There seems to be an issue with the Replication Agents Cleanup job where an extended delay (10+ min) will cause the job to silently fail.My Original reference: http://blogs.msdn.com/b/sqlserverfaq/archive/2010/09/17/sometimes-replication-monitor-shows-number-of-undistributed-transactions-as-a-very-high-number-but-replication-itself-works-properly.aspx

  1. On the subscription, under the all subscription tab, right click & choose “view details”.  Click on the Undistributed Commands tab.
    1. If the estimated time is less than a few minutes, and the number of commands is under a thousand or two, you’re OK.
      1. Sample of up to date(ish)
      2. ReplicationMonitor-UndistrinutedTransactions-blocked
  2. Run SQL to check for Transactions
    1. Part 1: Enter you <DATABASENAME> & <PUBLICATIONNAME>
      Declare
      @PublisherDB sysname,
      @Publication sysname,
      @agentid sysname;
       
      -- Set Publisher database name
      Set @PublisherDB = N'<DATABASENAME>';
      -- Set Publication Name
      Set @Publication = N'<PUBLICATIONNAME>';
      -- Get agent id
      Set @agentid = (select id from MSdistribution_agents where publisher_db =@PublisherDB and subscriber_db = @PublisherDB and Publication = @Publication)
      select @agentid
      --Get xact_seqno & delivery_rate
      select top 1 xact_seqno, delivery_rate from dbo.MSdistribution_history
      where agent_id = @agentid
      and runstatus in (2,3,4) order by time desc;

       

      1. Sample Output:
        ---Returned:
        xact_seqno    delivery_rate
        0x0003B66300019C4F0001000000000000    16.2109375
    2. Part 2:
      1. Note: replace <DELIVERY_RATE> and <XACT_SEQNO> with numbers from above, and enter you <DATABASENAME> & <PUBLICATIONNAME>
        -- PART 2:
        Declare
        @PublisherDB sysname,
        @Publication sysname,
        @agentid sysname;
         
        -- Set Publisher database name
        Set @PublisherDB = N'<DATABASENAME>';
        -- Set Publication Name
        Set @Publication = N'<PUBLICATIONNAME>';
        -- Get agent id
        Set @agentid = (select id from MSdistribution_agents where publisher_db =@PublisherDB and subscriber_db = @PublisherDB and Publication = @Publication)
        DECLARE @avg_rate int
        DECLARE @retcode int
        DECLARE @mstestcountab TABLE ( pendingcmdcount int )
        select @avg_rate = isnull(avg(<DELIVERY_RATE>),0.0)
                from dbo.MSdistribution_history
                where agent_id =  @agentid
         
            insert into @mstestcountab (pendingcmdcount)
                exec @retcode = sys.sp_MSget_repl_commands
                                            @agent_id = @agentid
                                            ,@last_xact_seqno = <XACT_SEQNO>
                                            ,@get_count = 2
                                            ,@compatibility_level = 9000000
         
        select pendingcmdcount from @mstestcountab
         
            select
                pendingcmdcount
                ,N'estimatedprocesstime' = case when (@avg_rate != 0.0)
                                        then cast((cast(pendingcmdcount as float) / @avg_rate) as int)
                                        else pendingcmdcount end
            from @mstestcountab
6 Restart Agents The first step for resolving the high number of Undistributed transactions was to restart the agents; this was fortunately the resolution in my case.Restart the agents – to be safe, this should be done after any network outage

  1. Under the Publication –> Agents Tab, right click & choose Stop Agent.
    1. Refresh to check that it has stopped, then right click & choose Start Agent.
  2. Under the Publication –> All Subscriptions tab, right click & choose Stop Synchronizing
    1. Refresh to check that it has stopped, then right click & choose Start Synchronizing
  3. Repeat 1 & 2 for the other node in the replication cluster
  4. Recheck the Undistributed Commands tab for each cluster.  It make take a minute or 2 for it to catchup & clear out the old transactions.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s