Merge Operator
Published: 24th August 2013
Continuing with the posts on Join Operators I’m going to move on from the Nested Loop join and give a very brief explanation of the MERGE operator.

To be honest this is one of the join types I actually see the least out of all three. I’m not sure why this is, but that’s my experience. Doesn’t mean it’s a bad join type though as it’s very, very fast when it does appear. But it does have a couple of pre-requisites before SQL will consider it a viable option.

Again, this isn’t going to be a massive Microsoft White Paper standard post (mainly because I’m nowhere near qualified to write one), but a simple overview just to get you understanding the basics as to why the optimiser will sometimes choose this over the alternatives and what it can mean to performance.

For another VERY in depth and, frankly, brilliant blog as to the in depth internals of this operator, see Craig Freedman’s blog here.

For the simple version, read on.

So, what is a Merge join?

Unlike the intuitive “Nested Loop”, I don’t particularly think of this one as a Merge but in my warped head I see it as a kind of zip. Yes, I’m am that odd. I’ll try and explain using a picture as I find this easiest…

Imagine we have tableA and tableB and we’re trying to join on tableA.colA and tableB.colB.

Hoping that makes sense, this is why I think of it as a zip as it seems to move from left to right (in my picture), interlocking as it goes. Yes… my head’s an odd place. But hopefully my explanation helped anyway.

Here’s what one looks like in an execution plan:

So… when would we want to see one of these in our execution plans? These are usually found when there are two medium sized tables, roughly equal in size, and when the join conditions are ordered. As my (poor) explanation above implies… the columns have to be in data order for the Merge join to function efficiently.

Sometimes a Merge will enforce an extra Sort in your execution plan if one is not present on the column and the optimiser still wishes to use the Merge join.

Let’s create an example using AdventureWorks:

select h.customerID
from sales.SalesOrderHeader h
inner loop join sales.Customer c
on h.CustomerID = c.CustomerID
where h.CustomerID >= 30100
go

select h.customerID
from sales.SalesOrderHeader h
inner merge join sales.Customer c
on h.CustomerID = c.CustomerID
where h.CustomerID >= 30100
go

select h.customerID
from sales.SalesOrderHeader h
inner hash join sales.Customer c
on h.CustomerID = c.CustomerID
where h.CustomerID >= 30100
go


As you can see… for this particular type of query, the merge join is actually the joint best performer in terms of reads and cpu… however the Merge join, like the Nested Loop, does not require memory in order to perform and therefore it is significantly less expensive than the Hash join.

So why don’t I see this much? My theory about this is because of the criteria required for the Merge join to take place… that being the requirement for both columns to be sorted prior to the Merge taking place. If there is no index then the optimiser would enforce a sort prior to Merging the data. This adds overhead and a memory requirement to the execution plan and therefore can tip the cost in favour of another join type.

Using this query will show you that the optimiser has had to include a Sort into the execution plan because there were no appropriate indexes it could use to obtain pre-sorted data…

select h.customerID, d.ProductID, d.UnitPrice
from sales.SalesOrderHeader h
inner merge join sales.SalesOrderDetail d
on h.SalesOrderID = d.SalesOrderID
where h.CustomerID >= 29100
go


Also, it’s not hard to increase a data volume to the point that a Hash becomes far more efficient.

However, when a Merge is chosen, it is very fast and therefore if indexes can be created to help force a Merge then your queries could speed up immensely.
Comments:
NB: Comments will only appear once they have been moderated.