Different Query Results with row_number()
Published: 15th July 2013
Today I was asked a question that left me baffled for a good few minutes whilst trying to come up with an answer. Sadly my mistake all along was that I trusted a developer. Ah well, happens to everyone sometimes.

Basically the issue was that QA had noticed that there were inconsistencies with data between our staging and QA servers. Obviously this shouldn’t be the case and the developers tracked it down to a specific stored procedure… one server returned one set of results, the other server a different set. Therefore the issue ended at my doorstep.

Initially it was a simple case of running the proc on a variety of test servers… initially all gave the same result and I was left completely baffled as to the problem. But then, when re-running the query, suddenly my results changed.

The problem was that the query was incredibly complex involving numerous base tables, temp tables, subqueries, and a large merge statement at the end. So I went for the obvious… base tables.

Turns out they weren’t to blame. All servers were identical, so what next? I looked at the merge… ran the query to that point, stripped it out, and ran it as a standalone. Turns out it was acting correctly and consistently (which I’ve had issues with before).

I tried a test box which I could clear the proc cache and buffers and ran it there… I looked at execution plans to see if anything was awry that could possibly cause something to malfunction. It was clearly “clutch at straws” time.

Well, after doubting my data, doubting the servers, and doubting my sanity… there it was… and it brings me back to the start… never trust developers over your databases.

Obviously I can’t post the exact code, so I’ve made an example… but this is the equivalent to what I found buried deep inside the code:

select firstName + ' ' + lastName as fullName
from
(
  
select row_number() over(order by lastName, streetName) rowID,
          
firstName, lastName
  
from testPeople p
  
join testAddress a
  
on p.addressID = a.id
) a
where rowID = 1


Now, of course this LOOKS harmless and clearly should return results. However, the problem is that it’s effectively creating a row number over something that is NOT unique and therefore SQL Server is free to pick and choose which record it wishes to return as rowID = 1. Therefore, in this case, a server could easily return different values for firstName and still be correct.

This is exactly what was happening in this situation. Luckily QA are pretty sharp but this is something that could easily have been missed if both servers had happened to churn out the same result.

Here’s a demo in full just to demonstrate:

if object_id('testPeople') is not null drop table testPeople
if object_id('testAddress') is not null drop table testAddress
go

create table testAddress
(
  
id tinyint identity(1, 1),
  
street varchar(50),
  
country varchar(15),
  
constraint pk_testAddress primary key clustered(id)
)
insert into testAddress
values('742 Evergreen Terrace', 'USA'), ('31 Spooner Street', 'USA')

create table testPeople
(
  
id int identity(1, 1),
  
firstName varchar(15),
  
lastName varchar(15),
  
addressID tinyint,
  
constraint pk_testPeople primary key clustered(id)
)
insert into testPeople
values('Lisa', 'Simpson', 1), ('Marge', 'Simpson', 1), ('Bart', 'Simpson', 1),
       (
'Peter', 'Griffin', 2), ('Lois', 'Griffin', 2), ('Stewie', 'Griffin', 2)

select firstName + ' ' + lastName as fullName, street
from
(
  
select row_number() over(partition by lastName order by lastName, street) rowID,
          
firstName, lastName, street
  
from testPeople p
  
join testAddress a
  
on p.addressID = a.id
) a
where rowID = 1


My test machine gave the results of this query as:

So now let’s remove Lisa and replace her with Homer:

delete from testPeople where id = 1
insert into testPeople select 'Homer', 'Simpson', 1

select firstName + ' ' + lastName as fullName, street
from
(
  
select row_number() over(partition by lastName order by lastName, street) rowID,
          
firstName, lastName, street
  
from testPeople p
  
join testAddress a
  
on p.addressID = a.id
) a
where rowID = 1


This results in:

As you can see, this doesn’t really make sense as you would think that Homer was inserted last and therefore would not be row_number 1, but all we can tell is that these results are not ordered by primary key as default. Neither are they ordered alphabetically by firstName, for example. As a normal user you’re never likely to know what result could appear and therefore you NEED to make your row_number unique such as, in this case, adding the person ID to the order by.

Anyway, not sure if anyone will find this useful, or come across something similar, but just in case you do see this pattern, at least you’ll think to look for a row_number as a starting point.
Comments:
NB: Comments will only appear once they have been moderated.