Part Populating Temp Tables
Published: 12th March 2014
The other day I was going through some code looking for a reduction in reads (as they were rather higher than expected), and whilst doing so I came across a piece of code that seems logical yet is actually quite a performance killer.

Basically it involves the practice of creating a temporary table (or permanent table) and only populating the columns in stages as the data becomes available. I’ll explain this with an example.

Now this example is complete nonsense, it simply runs as required for my explanation. In the real code I could understand what they had done and why, it was just that it was less than optimal. This example simply shows why it was less than optimal by using an extreme scenario.

Basically imagine we create the following temporary table to store some data:

use AdventureWorks2012
go

if object_id('tempDB..#orders') is not null drop table #orders
go

create table #orders
(
  
orderID int,
  
customerID int,
  
totalDue money,
  
miscCustInfo varchar(7000),
  
constraint pk_orders primary key clustered(orderID)
)


Now what we’re going to do is to populate our table in sections. Firstly we’ll populate the sales data and then, once that’s in place, we’ll use that to generate our miscCustInfo:

update #orders
set miscCustInfo =
  
case when customerID < 20000 then convert(char(7000), 'a')
      
when customerID >= 20000 then convert(char(7000), 'z')
  
end


As I mentioned, this is very much exaggerated as there was legitimate reason for them to populate in sections, but the principle was the same… the reads were horrific. But why?

Well this all comes back to the 8k page again.

When you create this table with a large null column yet don’t populate it, SQL Server will not populate this column on your data page. Therefore, in the above, with just two ints and a money column being populated, SQL can fit almost 1000 records per page.

However, as soon as you then populate your large field to its capacity (in this case the 7000 byte miscCustInfo), SQL Server suddenly has to fit this onto each page with the rest of its record and therefore suddenly a HUGE amount of page splits occur as SQL Server has to shuffle all this data around.

Likewise your transaction log gets battered. All in all this is very bad for performance.

I would recommend, if you really have to use this column by column methodology, is that you populate the large column up front with a dummy value and then simply write over it in your update. This means that you already allocate the column its space on each data page and therefore no page splits need to occur.

Here are the two methods broken down… firstly the poor approach:

use AdventureWorks2012
go

if object_id('tempDB..#orders') is not null drop table #orders
go

create table #orders
(
  
orderID int,
  
customerID int,
  
totalDue money,
  
miscCustInfo varchar(7000),
  
constraint pk_orders primary key clustered(orderID)
)
insert into #orders
(
  
orderID, customerID, totalDue
)
select h.salesOrderID, h.CustomerID, h.TotalDue
from sales.SalesOrderHeader h
where h.rowguid like '7%'

update #orders
set miscCustInfo =
  
case when customerID < 20000 then convert(char(7000), 'a')
      
when customerID >= 20000 then convert(char(7000), 'z')
  
end
go


Now a better approach:

use AdventureWorks2012
go

if object_id('tempDB..#orders') is not null drop table #orders
go

create table #orders
(
  
orderID int,
  
customerID int,
  
totalDue money,
  
miscCustInfo varchar(7000),
  
constraint pk_orders primary key clustered(orderID)
)
insert into #orders
(
  
orderID, customerID, totalDue, miscCustInfo
)
select h.salesOrderID, h.CustomerID, h.TotalDue, replicate('.', 7000)
from sales.SalesOrderHeader h
where h.rowguid like '7%'

update #orders
set miscCustInfo =
  
case when customerID < 20000 then convert(char(7000), 'a')
      
when customerID >= 20000 then convert(char(7000), 'z')
  
end
go


As you can see, this is much lighter on reads and this was only a VERY small table. In a larger scale this could be catastrophic to your query performance. This also ignores the impact of a page split on the transaction log and how much data would be generated there, all of which needs to be written out to disk.

Basically, please use dummy records to ensure your records are sized correctly from the start to avoid unnecessary overhead and poor performance down the line.
Comments:
NB: Comments will only appear once they have been moderated.