SQL Server "Distribution clean up: distribution" Job Failing Large MSRepl_commands
Published: 3rd May 2013
This was something I came across when working in a heavily replicated environment… the replication created job “Distribution clean up: distribution” suddenly began to fail constantly. This led me on a trail which had such drastic results that I felt compelled to document it.

I had a look around on the server in which this job was failing and found that the MSRepl_Commands table was HUGE!!! (This is the table which holds all commands to be replicated).

Here’s the situation I was seeing…

As you can see, this did not look right at all. This table was actually over 100 million rows when the errors were occurring… I managed to manually reduce it to this level just so that the cleanup job would work and before I started recording my findings.

Anyway, this is the job that was failing…

This was also taking a long time to run as it was running every 10 mins and taking between 6 and 15 mins each time it ran. This was causing a lot of problems, for example blocking, even before it started to fail.

The job runs the following code:

EXEC dbo.sp_MSdistribution_cleanup @min_distretention = 0, @max_distretention = 72

The purpose of this code is to clean up any commands in the MSRepl_Commands table which are more than 72 hours old. So, I simply ran this proc manually starting at retention time of 200 hours and working down to 72 in order to reduce the table size enough for the job to stop failing. This bought me some time at least.

However with MSRepl_commands being SO big I was aware that it was slowing down the server and could start to cause problems.

The reason this had been caused was the following:

When setting up a transactional replication using the wizard we get to the following screen…

This is where it all goes wrong. The way this company had always created replication sets is to tell the publisher to create the snapshot right away (people choose this it’s much easier than having to create the publication and then start the Snapshot Agent manually), therefore they simply check the box that says “Create a snapshot immediately and keep the snapshot available to initialize subscriptions”.

However, this is not actually a very wise thing to do. What is actually happening is that this changes a setting in the publication which means that ALL transactions are held for 72 hours and only then deleted from the MSRepl_commands table by the cleanup job. You can see this by using this query and looking at the “immediate_sync” column…

If the immediate_sync column is set to 1 then all commands are being held for 72 hours. However, if it is set to 0 then commands are removed from MSRepl_commands the moment that they have been passed to all subscribers… not held for the full 72 hours. In this case most replication runs every couple of minutes and therefore commands don’t need to be held for more than roughly 3 mins before they can be deleted.

Therefore I ran the following code to set the flag to 0…

-- Check to see if "immediate sync" is set to 1.
--  This will cause transactions to be held in MSRepl_commands 
-- rather than deleted when distributed.

exec dbo.sp_helppublication @publication = '' -- put your publication name here

-- How to alter the setting to 0.
EXEC sp_changepublication
    @publication = '', -- put your publication name here
    @property = 'allow_anonymous',
    @value = 'false'
GO

EXEC sp_changepublication
    @publication = '', -- put your publication name here
    @property = 'immediate_sync',
    @value = 'false'
GO


Once the setting was put to 0, I ran the cleanup job again and it made a HUGE difference. Instead of millions of rows, it dropped to virtually nothing…

Before:

After:

The cleanup job now runs in about 10 seconds rather than 6-13 mins, a massive improvement.
Comments:
NB: Comments will only appear once they have been moderated.