Be Very Careful with Sampled Statistics
Published: 11th March 2015
Okay, we all know that Sampled Statistics are not perfect otherwise there would be no need for a Full Scan stats update. However, there’s an issue I encountered recently that has really bothered me in regard to these.

When we perform a sampled stats refresh we know that SQL Server will sample a smaller and smaller percentage of the rows in our table (which is based on data size, not pure row count) so we’re all accustomed to living with that, but there’s another flaw that really drives me mad…

During some of the work I’ve been doing recently we are constantly updating statistics in order to get good plans (which is pretty standard) because we’re inserting a lot of ascending key data and, in older versions than SQL 2014, this can cause a lot of problems with SQL Server thinking we’re only obtaining 1 record and therefore giving Nested Loops when we need Hash Joins.

This can be alleviated using Trace Flags (I’ll cover those separately) but it’s still not perfect if you ever do anything which causes SQL Server to think the keys may not actually be ascending.

So anyway… “You’re updating stats so what’s the problem?”, I hear you ask.

Well, these tables are upwards of 150GB in size in some cases and therefore we’re relying on sampled stats because a full scan simply isn’t feasible.

BUT, as I found out, SQL Server doesn’t seem to think that it’s necessary to book end the sample. Ie. It doesn’t always sample the min and max values in the table.

To me this seems pretty fundamental as you would think it would simply take the first, the last, and then whatever sample it wanted.

When dealing with ascending keys it means that nearly all queries being run are going to want the most recent key values and SQL simply doesn’t pick them up. Therefore I was finding that even running an update stats wouldn’t help with my execution plans and in the end I had to resort to filtered stats (which can be tricky if you don’t have something nice like Date to work with) alongside query hints.

These are all viable but fiddly and wouldn’t be needed if Sampled Stats didn’t have, what seems to me, a massive flaw.

If you want to have a look yourself, then run the following code and have a look at the output:

use testDB
go

if object_id('myTestStatsTable') is not null drop table myTestStatsTable
go

create table myTestStatsTable
(
  
id int identity,
  
idINT int,
  
myValue char(1000) default('a'),
  
constraint pk_myTestStatsTable primary key clustered(id)
)
go

declare @statHeader table
(
  
keyName varchar(50),
  
updateTime varchar(50),
  
noRows int,
  
sampledRows int,
  
steps tinyint,
  
density float,
  
avgKeyLength float,
  
stringIndex varchar(3),
  
filterExp varchar(10),
  
unfilteredRows int
)

declare @statsHist table
(
  
rangeHiKey int,
  
rangeRows float,
  
eqRows float,
  
distintRangeRows float,
  
avgRangeRows float
)

declare @output table
(
  
id int identity,
  
noRows int,
  
sampledRows int,
  
maxStatsID int
)

declare @rowLimit int = 1000000, @sql varchar(max)

while isnull((select max(id) from myTestStatsTable), 0) <= @rowLimit
begin
   insert into
myTestStatsTable(idINT)
  
select top (10000) 1
  
from sys.objects o
  
cross apply sys.columns c
  
cross apply sys.allocation_units a
  
  
if (select max(id) from myTestStatsTable) % 50000 = 0
  
begin
       update statistics
myTestStatsTable

      
select @sql = 'dbcc show_statistics (''myTestStatsTable'', ''pk_myTestStatsTable'') with stat_header'

      
insert into @statHeader
      
exec (@sql)

      
select @sql = 'dbcc show_statistics (''myTestStatsTable'', ''pk_myTestStatsTable'') with histogram'

      
insert into @statsHist
      
exec (@sql)

      
insert into @output
      
select s.noRows, s.sampledRows, h.highValue
      
from @statHeader s
      
cross apply
      
(
          
select max(rangeHiKey) highValue
          
from @statsHist
      
) h

      
delete from @statHeader
      
delete from @statsHist
  
end
end

select
noRows, sampledRows,
       (
sampledRows / (noRows * 1.)) * 100 as percentSampled,
      
maxStatsID
from @output
order by id


Comments:
NB: Comments will only appear once they have been moderated.