blogs.conchango.com

welcome to the conchango blogging site
Welcome to blogs.conchango.com Sign in | Join | Help
in Search

SSIS Junkie

T-SQL: A T-SQL Poser - Part 3

A couple of weeks ago I posed a T-SQL problem here and later some possible solutions here. You should probably read those posts before you read this. In a nutshell I wanted to concatenate values from multiple rows into a comma-delimited list. In the second post a I promised a performance comparison of the two main solutions that were put forward and that's what this post is for.

Here's the first solution:

WITH rownumcte1 AS
(
        SELECT  ROW_NUMBER() OVER (PARTITION BY id ORDER BY id,name) AS rownum,id,name
        FROM    t1

)
,rownumcte2 AS
(
        SELECT  rownum,id,name
        FROM    rownumcte1
        WHERE   rownum = 1
        UNION   ALL
        SELECT  r1.rownum,r1.id,r2.NAME + ',' + r1.name
        FROM    rownumcte1 r1
        INNER   JOIN rownumcte2 r2
        ON      r1.id = r2.id
        AND     r1.rownum = r2.rownum + 1
)
SELECT  r2.id,r2.NAME
FROM    (
        SELECT  id,MAX(rownum)maxrownum
        FROM    rownumcte1
        GROUP   BY id
) r1
INNER   JOIN rownumcte2 r2
ON      r1.id = r2.id
AND     r1.maxrownum = r2.rownum

and here's the second:

SELECT id, STUFF( ( SELECT ','+ name FROM t1 a WHERE b.id = a.id FOR XML PATH('')),1 ,1, '')

FROM t1 b
GROUP BY id

 

In order to test them out I needed some test data which I produced using the following script:

USE tempdb
GO

if exists (select * from sys.tables where name = 't1')
drop table t1
go
create table t1 (id INT, NAME VARCHAR(MAX))
go

declare @m int, @n int
set @m = 100
set @n = 100

declare @outerloopcounter int, @innerloopcounter int
set @outerloopcounter = 1

while @outerloopcounter <= @n
begin
 set @innerloopcounter = 1
 while @innerloopcounter <= @m
 begin
  insert t1
  values (@innerloopcounter, CHAR(ROUND(97 + (RAND() * (25)),0)) )
  set @innerloopcounter = @innerloopcounter + 1
 end
 set @outerloopcounter = @outerloopcounter + 1
end

The script gives us a 10000 row dataset with 100 distinct values in the [ID] column and a random character in the [Name] column.

I was going to do a detailed analysis of the two solutions with multiple tests and all that sort of good stuff but on the first execution I realised there was no point. Solution 1 took 14m50s whereas solution 2 took less than a second. That's pretty conclusive on which one is quicker. I then wanted to know why.
 

If you break the query down it becomes fairly obvious. The second CTE produces 10000 rows; that's 100 rows for each of the 100 values in [ID]. The screenshot below shows what the second CTE returns:

 

You get the idea. Now take a look at what comes after the CTEs. The first CTE (10000) is joined with the second (10000 rows) and that join produces 100000000 rows which is then grouped. The GROUP BY implicitly causes a sort operation and taking a look at the execution plan tells me that the sort of those 100000000 rows costs 96% of our total query time.

The second solution that uses the FOR XML PATH('') construct results in a NESTED LOOP (i.e. inner join) operation over 100 rows and no SORT operations. When we break it down like this it is clear why the second solution is so much quicker.

 

I doubt you're still reading at this point because analysing queries isn't exactly a fun task. If you are though I urge you to take the code from above and try it out for yourself. Play with the values of @m and @n and see how the execution time differs for each solution. I have learnt alot out of this little episode and you may do too.

 

-Jamie

 

You get the idea. Now take a look at what comes after the CTEs. The first CTE (10000) is joined with the second (10000 rows) and that join produces 100000000 rows which is then grouped. The GROUP BY implicitly causes a sort operation and taking a look at the execution plan tells me that the sort of those 100000000 rows costs 96% of our total query time.

The second solution that uses the FOR XML PATH('') construct results in a NESTED LOOP (i.e. inner join) operation over 100 rows and no SORT operations. When we break it down like this it is clear why the second solution is so much quicker.

 

I doubt you're still reading at this point because analysing queries isn't exactly a fun task. If you are though I urge you to take the code from above and try it out for yourself. Play with the values of @m and @n and see how the execution time differs for each solution. I have learnt alot out of this little episode and you may do too.

 

-Jamie

P.S. I have no idea why the last part of this blog post appears twice. For some reason our obscenely pathetic blog engine puts it there twice - I've spent ages trying to correct it and have given up, exasperated.

Published 05 April 2007 07:17 by jamie.thomson
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Jason Haley said:

April 5, 2007 14:46

Leave a Comment

(required) 
(optional)
(required) 
Submit

This Blog

Syndication

News

Powered by Community Server (Personal Edition), by Telligent Systems