blogs.conchango.com

welcome to the conchango blogging site
Welcome to blogs.conchango.com Sign in | Join | Help
in Search

SSIS Junkie

Conchango are busy and need talented consultants in and around London. Interested? Email me or send me a message

SSIS: Case-sensitivity in Lookup component

Jorg Klein has just written a useful blog entry about the SSIS Lookup component entitled "SSIS – Lookup Transformation is case sensitive". Jorg correctly observes that:

A lookup transformation uses full caching by default. This means that the first thing it does on execution, is loading all the lookup data in its cache. When this is done it works as expected, but with case sensitivity.

Jorg suggests a workaround of using a different CacheType:

The solution is to set the CacheType property of the lookup transformation to Partial or None, the lookup comparisons will now be done by SQL Server and not by the SSIS lookup component.

This will (usually) work but I want to point out a couple of pitfalls here:

  1. It will only work if the SQL Server database that the Lookup component is connecting to is using a case-insensitive collation
  2. This technique issues a query against the database for every single row in the pipeline and hence it will usually be an order of magnitude slower than using CacheType=Full. [N.B. Please read Paul Williams' comment below where he gives a great explanantion of why I chose to use the word "usually" here.]

 

For these reasons changing the CacheType property isn't a technique that I ever employ. Instead I use Jorg's other suggested workaround:

Another solution is to format the data before you do the lookup. You can do this using the T-SQL LOWER() or UPPER() functions. These functions can be used in a query or for example in a derived column SSIS component.

In that last sentence Jorg suggests changing the case in EITHER the lookup cache OR the pipeline data. I would go further than that, I would change the case in BOTH places so as to completely avoid the chance of failed cache hits.

 

Thanks to Jorg for raising an important issue because I daresay that the majority of SSIS developers have been stung by this at some point. The lack of a case-insensitive Lookup component in SSIS bothers me greatly. If you want to add your weight to the campaign for a case-insensitive Lookup component then go here to vote and voice your opinion.

-Jamie

Published 12 February 2008 16:59 by jamie.thomson

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Matthew Roche said:

Jamie - I think it's interesting that the Connect link you posted is one of many that is closed as a duplicate issue. There are so many different Connect issues related to the limitations of the Lookup transform (and its inherent case sensitivity being a primary one) that it is often difficult to keep track of them. This one (http://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=127229) is the "master" issue posting and has an average rating of 4.88 by 56 users, but has still been open for well over two years....

;-(

February 12, 2008 19:00
 

jamie.thomson said:

Matthew,

Very true. However in the case of the submission that you point to, both of the items that have been raised there have been fixed in katmai even though the comment says (laughingly) "the issue has been moved to the next phase as it will not make the Katmai release".

WHAT DO YOU MEAN ITS BEEN MOVED TO vNext? ITS ALREADY BEEN FIXED BY THE PROVISION OF THE NEW LOOKUP.

You gotta laugh or you'd cry.

-Jamie

February 12, 2008 19:23
 

Phil Brammer said:

Actually, Jamie's linked Connect submission claims it is currently active...  Not too long ago, I created a new Connect submission for the purposes of fixing this in Katmai (because, at the time Jamie's submission was closed).  One way or another, I think they need to allow the developer an easy way to perform case-insensitive lookups.

It may be difficult for us to track the Connect submissions on this topic, but rest assured, the dev team is aware of the problems, not just from Connect, but also from my (our?) direct communications with the dev team.  

February 13, 2008 05:15
 

Matt Masson said:

Alright, that didn't look right so I did some digging. It looks like this is what happened:

The issue that Matthew listed is being used as a "master tracking" issue internally. Its description doesn't actually look anything like it does on Connect. When those other connect items for lookup were closed as duplicate, whatever they were asking for was added to the master item. When one of our PMs posted that we were pushing this feature to the next major release, they probably didn't notice the description difference between the bug and the connect issue (the original text isn't immediately apparent in our system). They probably should have been clear about which features aren't in Katmai (ie. case insensative lookup)

I think we definitely need to improve the process here, and it's one of the things our PM team will be working on.

In the mean time, feel free to shoot me an email if the status of a connect item isn't clear. We will be going through and updating old connect items at some point, but right now we're focused on ironing out the remaining issues, and getting Katmai shipped. :)

February 13, 2008 17:09
 

Paul Williams said:

Jamie,

I originally thought that also using the default of full caching would always be quicker (and also according to BOL this seems to suggest this) but my experience indicates this is not the case.

It is correct to say that by setting to partial or no cache 'This technique issues a query against the database for every single row in the pipeline and hence it will usually be an order of magnitude slower than using CacheType=Full' but note the word USUALLY.

The answer as always is it depends ie it depends upon the amount and type of data that you are using and also the % matches you get against your lookup.

Let me explain.

It depends upon the size of the reference data set and the data in the pipeline e.g. if reference data set is 100,000 rows, but only 20 rows come into the pipeline, full cache will have to cache 100,000 rows first.

Using partial cache will be much quicker since no data is cached first and even though the reference query has to run for each row coming through the pipeline it will be much quicker because it is likely that it finds these 20 rows without having to compare against all 100,000 rows.

Another impact will be whether the data coming in is unique or not ie how often a match will occur against the LKP reference data set. If it will occur often the full cache is less likely to be of benefit.

Another impact I found was the no. of Lookup transforms themselves in the pipeline ie. if you had mutliple Lookups in one DFT.

When any Lookups are used in the DFT, & the Lookups are set to use Full Cache mode, ALL the data from ALL the Lookup tasks is first cached before the pipeline is started so e.g.

if have lookup transform at beginning and end of pipeline it still caches all data for both before starting the pipeline (can see this happening in the progress tab).

This may not improve performance because say e.g. the first LKP in the transform it would be useful to have Full cache, but the pipeline diverges so that only a subset of the data initially coming in reaches the final LKP transform then having to cache all the data first may not have been an advantage.

Performance will also depend upon whether the LKP’s in the DFT use exactly the same reference data set or not since it will only be cached once with full cache.

Here are some results on testing that I did (times are for Full, partical and no cache):

No. rows in pipeline = 20,

No. rows in reference dataset = 100,000 58 secs 14 secs 14 secs In this case Partial cache is faster.

No. rows in pipeline = 40,

No. rows in reference dataset = 38,000 24 secs 34 secs -- In this case Full cache is faster.

No. rows in pipeline = 50,

No. rows in reference dataset = 65,000 2 mins 45 secs 47 secs In this case Partial cache is faster.

Therefore Jamie my testing showed that in fact it DEPENDS ! To never change the CacheType property from FULL is something I never do. I always test different scenarios with the data. You may surprisingly find that partial or no cache is indeed faster.

February 14, 2008 09:16
 

jamie.thomson said:

Paul,

You are absolutely right, that is why I used the word "usually" in my blog entry above. However, it was remiss of me not to make the point more explicitly and I am incredibly grateful to the lengths you have gone to here to illustrate the point. I have added a comment to the blog entry accordingly to direct readers to your comment.

-Jamie

February 15, 2008 01:39
 

Paul said:

Glad to have provided feedback and thanks for all your blog entries. They have helped me a great deal in developing with SSIS.

February 15, 2008 08:35

Leave a Comment

(required) 
(optional)
(required) 
Submit

This Blog

Syndication

News

Powered by Community Server (Personal Edition), by Telligent Systems