blogs.conchango.com

welcome to the conchango blogging site
Welcome to blogs.conchango.com Sign in | Join | Help
in Search

SSIS Junkie

I am currently on holiday until 13th October so comments are disabled until then. Feel free to use the "Email" link but don't expect an answer.

Conchango are busy and need top level Technical Architects for Microsoft & Open Source platforms in and around London. Interested? Email me or send me a message

T-SQL: A T-SQL Poser - Part 2

OK, a couple of days on and it seems my T-SQL poser created a little bit more interest than I expected - 7 comments and some follow-up blog posts testify to that (yes, 7 is alot for me). If you're browsing here first then go and read the previous post to understand the background here.

I promised I would post my solution and Saivendra's, and so I will. I'll also comment on some of the solutions that other people posted. Just before I do though, one thing I should have said in the previous post was that the order of the rows in the resultset was not important. I never stated that the rows had to be returned in a certain order and nor did the list of items in the comma-delimited list. However, judging by most of the responses people assumed this was part of the requirement - it wasn't. The only reason I mention this because I don't want an ORDER BY clause to interfere with execution duration when I later do a performance comparison of the various methods.

So, onward with sharing the various solutions to this problem...

#1 - A couple of CTEs

This was my solution:

WITH rownumcte1 AS
(
        SELECT  ROW_NUMBER() OVER (PARTITION BY id ORDER BY id,name) AS rownum,id,name
        FROM    t1

)
,rownumcte2 AS
(
        SELECT  rownum,id,name
        FROM    rownumcte1
        WHERE   rownum = 1
        UNION   ALL
        SELECT  r1.rownum,r1.id,r2.NAME + ',' + r1.name
        FROM    rownumcte1 r1
        INNER   JOIN rownumcte2 r2
        ON      r1.id = r2.id
        AND     r1.rownum = r2.rownum + 1
)
SELECT  r2.id,r2.NAME
FROM    (
        SELECT  id,MAX(rownum)maxrownum
        FROM    rownumcte1
        GROUP   BY id
) r1
INNER   JOIN rownumcte2 r2
ON      r1.id = r2.id
AND     r1.maxrownum = r2.rownum

 

There's a couple of CTEs in there (one referencing the other, thus nested) plus the ROW_NUMBER() function. Both are new in SQL Server 2005 so this solution wouldn't have worked in SQL Server 2000. I won't go into the specifics of how it works -I'll let you do that for yourself if you're interested- but I'm hoping that this solution will scale pretty well given that it is wholly set-based.

Nick Barclay came up with a very similar solution that also used ROW_NUMBER() but made do with just one CTE - here is Nick's solution:

with ConcatNamesCTE (id, [name], rn)
as
(
select
    id, [name], rn
from
(   select   id, [name],
             row_number() over(partition by id order by id) as rn
    from t1
) a
where rn = 1
union all
    select   b.id, cn.[name] + ',' + b.[name], b.rn
    from
    (   select   id, [name],
                 row_number() over(partition by id order by id) as rn
         from t1
    ) b
    inner    join ConcatNamesCTE cn
    on       cn.id = b.id
    and      cn.rn + 1 = b.rn
)
select   d.id, d.[name]
from
(   select   max(rn) as rn, id
    from ConcatNamesCTE
    group    by id
) c
inner    join ConcatNamesCTE d
on       d.id = c.id
and      d.rn = c.rn

 

Obviously Nick is a fan of using derived tables Smile.

#2 - FOR XML

As I said before, Saivendra came up with a much more concise solution. Here it is:

SELECT id, STUFF( ( SELECT ','+ name FROM t1 a WHERE b.id = a.id FOR XML PATH('')),1 ,1, '')

FROM t1 b
GROUP BY id

Pretty neat huh? The STUFF function is used to drop the opening comma, that much is easy. The trick is in FOR XML PATH('') which, prior to this little episode, I never knew about so I have definately learnt something here. This solution (or variations thereof) was also put forward by David Portas, Jamie Hunter and Rob Farley.

One thing that immediately jumps out at me is that this solution contains a correlated subquery that will get executed for every row in the table. Hence, I have some reservations about performance over large datasets but I'm going to investigate that later (I'm hanging myself out to dry here because I'm saying that before doing any performance testing).

David also points out another drawback of this technique; some values won't get presented properly. Try inserting the following row into the table 

INSERT t1 values (3,'Contains a > delimiter')

Here's the output:

Not perfect I think you'll agree!

#3 - PATINDEX

A further solution was put forward by Adrian Downes. Here it is:

 

select res.id, max(res.name) as [name]
from
(
    select   c.id,
             case when PATINDEX('%' + d.name + '%', c.name) = 0 and c.id = d.id
                          then c.name + ', ' + d.name
                      else c.name
             end as [name]
    from
    (
         select   a.id, min(a.name) as [name]
         from
         (   select   y.id,
                 case when PATINDEX('%' + z.name + '%', y.name) = 0 and y.id = z.id
                      then y.name + ', ' + z.name
                 end as [name]
             from t1 y
             inner join t1 z
             on y.id = z.id
         ) a
         group by a.id
    ) c
    inner join t1 d
    on c.id = d.id
) res
group by
    res.id

Unfortunately Adrian, this one falls down a bit. It only works for the dataset that I supplied. For example I added the following extra row into the table:

INSERT t1 values (1,'Jim')

it no longer supplies the required result because it expects that you will have no more than 3 rows per ID value. Its not really the generic solution that I was looking for. My bad, I should have specified in the original post that it should work for all cases and any number of rows in the original table and I must apologise to Adrian for that who has dedicated some of his time to this. Thank you for the input though Adrian - its still valuable.

#4 - Recursive CTE

RickR has supplied another solution that leverages a CTE. On the surface this is actually my particular favourite because it solves the problem recursively even though there is no obvious way of recursing the table such as a self-referencing join. Its a very creative way of applying our T-SQL "bag of tricks". Here's the code

;WITH n1 AS (
    SELECT id, CONVERT(varchar(500), MIN(name)) name
    FROM t1
    GROUP BY id
UNION ALL
    SELECT a.id, CONVERT(varchar(500), a.name + ',' + b.name) name
    FROM t1 a
    JOIN n1 b
    ON a.id = b.id
    AND a.name > b.name
)
SELECT   id, max(name)
FROM n1
GROUP    BY id

As I just said, I really like RickR's solution however I upon digging into it further I have found some problems. Firstly, as the number of rows in the table per ID increases, the number of rows returned from the CTE increases exponentially. Even for the simple sample dataset that I posted containing 5 rows the CTE returns itself returns 6 rows which is why the GROUP BY is needed after the CTE. If I add the following row again:

INSERT t1 values (1,'Jack')

then the CTE returns 10 rows. Add another row with ID=1 (so there is now 7 rows in the table) and the CTE returns 18 rows. I think you can see the potential problem here - the performance of RickR's solution will decrease dramatically as the number of values per ID increases. I figured that this could be solved by using a GROUP BY in the recursive part of the CTE but when I tried that I got the following error message:

Msg 467, Level 16, State 1, Line 4
GROUP BY, HAVING, or aggregate functions are not allowed in the recursive part of a recursive common table expression 'n1'.

So, I thought that Rick's method would work with the caveat that it couldn't be used on large datasets. However, there is another problem with Rick's solution as well (sorry Rick). If a name value appears multiple times for the same ID then in the final comma seperated list it will only appear once. Try it, add the following row (which is a duplicate of an existing row) to the original table:

INSERT t1 values (1,'Joe')

The repeated value only appears once in the resultset from RickR's solution. Hence, this simply doesn't work I'm afraid. Its a shame because like I said, I really liked Rick's method of turning this into a recursive problem. Rick, if you can solve this problem of repeated values not appearing then I'll gladly re-assess it.

 

Thank you to everyone that replied, its great to see people coming up with diverse solutions to this problem.

#2 is definately the neatest however as pointed out it doesn't work for all textual values. I have high hopes that solution #1 containing CTEs will perform better as well

In my next post I will compare the performance of solutions #1 and #2 along with Nick Barclay's slightly modified version of #1.

-Jamie

 

 

Published 26 March 2007 06:43 by jamie.thomson
Filed under: ,

Comments

 

Rick R said:

Here's a modified version of my recursive CTE solution that preserves duplicates and prevents the output from the CTE from exponentially increasing:

WITH n0 AS (

SELECT ROW_NUMBER() OVER (ORDER BY id, name) rowid, id, name

FROM t1

),

n1 AS (

   SELECT MIN(rowid) rowid, id, MIN(name) name

   FROM n0

   GROUP BY id

UNION ALL

   SELECT a.rowid, a.id, a.name + ',' + b.name name

   FROM n0 a

   JOIN n1 b

   ON a.id = b.id

   AND a.rowid = b.rowid + 1

)

SELECT   id, max(name)

FROM n1

GROUP    BY id

Now, this is your game and we're playing by your rules, so if you want the dupes, you can say they need to be there.  But the output looks suspiciously like a set, and true sets don't have duplicates.  Really, you are requiring the solution to account for duplicate tuples in the source data, which isn't theoretically correct either.  In a real situation, I would put the question to someone to explain what they plan to do with duplicates in a set like that, or why they would expect them to be there in the first place; often enough it exposes a logical flaw. But, hey, we're just doing this for sport, so logic be damned, right? ;-)

Anyway, after looking over the other solutions, I noticed that my modified form ended up pretty similar to your solution #1, so I'll leave it to you to decide whether it even still qualifies on its own.

Rick

March 26, 2007 17:21
 

jamie.thomson said:

Hi Rick,

Thanks (yet again) for your input. Fair point about the dupes - although duplicates in sets is a theological discussion I don't want to get into right now :)   Of course, coping with dupes if they occur improperly is dead easy as we know.

I agree, your new version looks alot like Nick's and my own. I'll definately test it though and if there is any significant difference then I'll mention it.

Thanks again for your input. It really is appreciated.

Regards

Jamie

March 26, 2007 19:24
 

Gil said:

To avoid the > issue with solution #2 you can do:

SELECT id, REPLACE(REPLACE(STUFF( ( SELECT ','+ name FROM t1 a WHERE b.id = a.id FOR XML PATH('')),1 ,1, '') ,'&gt;','>'),'&lt;','<') AS name_csv

FROM t1 b

GROUP BY id

March 26, 2007 20:19
 

jamie.thomson said:

Thanks for the input Gil. I'm guessing there might be other characters that we might have to worry about as well. I guess we'll find out in due course.

-Jamie

March 26, 2007 20:34
 

Rick R said:

There are two instances in which characters will be converted:

1. predefined XML entities (<, >, &, ', and ")

I found that only <, >, and & cause the output to be escaped with the entity, so & would become &amp;

2. Illegal XML characters

Characters below ascii 32 (#x00-#x1F) are not legal except for tab (#x9, ascii 9), linefeed (#xA, ascii 10), and carriage return (#xA, ascii 13).  In SQL Server, if you use *any* character in the range #x0-#x1F, including those 3, it will be converted to the form &#x00 - &#x1F in the output.

INSERT t1 values (4,'char0 = ' + CHAR(0))

INSERT t1 values (4,'ampersand = &')

the output will look like

4           char0 = &#x00;,ampersand = &amp;

March 26, 2007 22:26
New Comments to this post are disabled

This Blog

Syndication

News

Powered by Community Server (Personal Edition), by Telligent Systems