blogs.conchango.com

welcome to the conchango blogging site
Welcome to blogs.conchango.com Sign in | Join | Help
in Search

Kristian Wedberg - Business Intelligence, Data Warehousing, SSIS

A Swede on a walkabout

SSIS lookup transform gotcha

We recently used the lookup transform to retrieve foreign keys in a data warehouse project, and were plain lucky to notice the following issue:

When looking up padded varchar strings, the records the lookup transform considers equal differs depending on what CacheType has been selected.

As shown in the example below, using CacheType = None or Partial will match 4 out of 4 records, while using CacheType = Full will match only 2 out of 4 records. Now, it could be argued that both cases are correct, depending on your definition of string equality, but the insidious problem is of course that changing the CacheType changes what records get matched, which is quite unintuitive and could lead to hard to find bugs in your packages.

Thankfully, this only affects padded varchar strings (which are fairly uncommon), and doesn't affect padded char strings (which are much more common.)

Update 2: Microsoft says this behaviour is by design, and comes from SSIS (CacheType = Full) comparing strings differently from SQL Server (CacheType = None or Partial). Sounds like we're stuck with it; read on for details and workarounds.

Demonstration package

We'll demonstrate this by creating two tables, each with four records, corresponding to the four possible combinations:

  • Unpadded lookup string & Unpadded reference string
  • Unpadded lookup string & Padded reference string
  • Padded lookup string & Unpadded reference string
  • Padded lookup string & Padded reference string

The SQL table creation code contains:

CREATE TABLE Lookup (
 ID   INT IDENTITY(1,1),
 String  VARCHAR(10),
 LookupDescription VARCHAR(200)
)
GO
INSERT Lookup (String, LookupDescription) VALUES ('string1', 'This lookup string was NOT padded')
INSERT Lookup (String, LookupDescription) VALUES ('string2', 'This lookup string was NOT padded')
INSERT Lookup (String, LookupDescription) VALUES ('string3   ', 'This lookup string WAS padded')
INSERT Lookup (String, LookupDescription) VALUES ('string4   ', 'This lookup string WAS padded')
CREATE TABLE Reference (
 ID  INT IDENTITY(1,1),
 String VARCHAR(10),
 ReferenceDescription VARCHAR(200)
)
GO
INSERT Reference (String, ReferenceDescription) VALUES ('string1', 'This reference string was NOT padded')
INSERT Reference (String, ReferenceDescription) VALUES ('string2   ', 'This reference string WAS padded')
INSERT Reference (String, ReferenceDescription) VALUES ('string3', 'This reference string was NOT padded')
INSERT Reference (String, ReferenceDescription) VALUES ('string4   ', 'This reference string WAS padded')

The data flow uses an OLE DB source connecting to a SQL Server 2005 to get the records from the Lookup table, and looks up the four strings in the Reference table with a lookup transform.

CacheType = None or Partial

When setting CacheType to None or Partial, all four records match, i.e. the lookup transform disregards any padding the strings have:

Lookup_None-Partial_CacheType.gif

CacheType = Full

When setting CacheType to Full, padding is considered part of the string and only two records match:

Lookup_Full_CacheType.gif

At a guess, the two types of behaviour comes from Full CacheType using a different SQL query when getting all reference records up front, compared to None and Partial CacheType which get the reference records only when needed, one by one.

Workarounds

Simplistic

The (too) simple remedy is of course to just be aware of this behaviour, and use the appropriate CacheType depending on what string equality version is needed. This is however both error prone and restrictive, we really want to be able to choose the appropriate CacheType depending on number of records etc.

Remove padding on the fly

A fairly good alternative is to remove the padding inside the data flow, i.e.:

  • Trim the incoming strings either at the source component with SQL RTRIM(fieldname), or in a Derived Column component
  • Trim the Reference records by adding RTRIM(fieldname) to the Reference query inside the lookup component

This is more flexible than the first alternative, but requires you to remove the padding every time you do lookups on these fields, again somewhat error prone.

Remove padding ahead of time

In our project we instead made sure that all varchar string fields were consistently unpadded when imported into the data warehouse, which works great when you have the luxury of being able to change the lookup and reference records.

End game

Long term, I hope the lookup transform will be updated to remove this 'feature', I'll feedback this to MS.

Update: Reported to Microsoft.

Try out

To see this in action, download the package, configure the connection for your own database, run with debugging with different CacheType settings, and watch either 2 or 4 records get matched.

Conclusion

We've looked at an error prone feature of the lookup transform when matching padded varchar strings. Workarounds include removing padding from the strings before and as part of the lookup, or being very specific about which CacheType must be used for achieving correct behaviour in each instance the lookup transform is used.

Published 22 February 2006 16:43 by Kristian.Wedberg
Filed under:

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

jamie.thomson said:

Great spot. Very strange though. I hope MS address it!

-Jamie
February 23, 2006 00:07
 

Jamie Thomson - Life, the universe and SSIS! said:

A colleague of mine at Conchango, Kristian Wedberg, has started blogging on SSIS. Catch him here: http://blogs.conchango.com/kristianwedberg/default.aspx...
February 23, 2006 00:15
 

Jamie Thomson - Life, the universe and SSIS! said:

A colleague of mine at Conchango, Kristian Wedberg, has started blogging on SSIS. Catch him here: http://blogs.conchango.com/kristianwedberg/default.aspx...
February 23, 2006 00:48
 

Mathew Flannery said:

I think I have discovered a different wrinkle to the lookup gotcha. This involves matching on an empty string i.e. ''.
I create a lookup where the input column and the lookup column both contain a empty string as a possible value. With CacheType = Full, the lookup will not find a match for any incoming rows with an empty string. This not quite the way the padded varchar issue was described. There is no padding involved from the source data since these are empty strings in the SQL Server tables.
This seems to be caused by DT_STR data type SSIS uses internally. The source data is a char(3) with len() = 0, but when it is in the pipeline it turns into DT_STR with len() = 3. It turned the empty string into three blanks?!?
The lookup loading reference data into memory appears to maintain the char(3) as len()=0 for the empty string. The lookup fails to find a match because '' != ' '. I was able to confirm this behavior by using a case statement in the lookup SQL substituting three blank spaces for the empty string. The lookup worked.
This oddity is unfortunate since using CacheType = None slows down performance significantly.
March 8, 2006 20:15
 

Kristian.Wedberg said:

Well spotted, makes me even more convinced one should apply a crow bar and some duct tape to the Lookup transform!-) Meanwhile in the real world, MS is updating BOL to explain the Lookup behaviour, from my feedback post:


Reopened by Kristian Wedberg on 2006-02-26 at 16:20:51

Since the current behaviour is very unintuitive, I suggest that BOL is updated to clearly point out the different behaviours of the Lookup transform in respect of chosen CacheType.

Resolved as By Design by Microsoft on 2006-03-01 at 17:08:03

The topics for the Lookup transformations have been updated to include information about how using different cache types affect string comparisons. The updated information will be available in the April Web release of SQL Server Books Online.
March 9, 2006 10:02
 

Perry said:

I knew about the bug that the cache setting wrecks lookups on NULLs -- much like your example, the results of matches involving NULLs changes depending on the cache setting. I've reported this bug to the SSIS team several times, but they seemed inclined to call it a (bad) "feature" and not want to fix it. (So the user gets to either have a broken lookup, or cache every lookup entirely in memory, which means SSIS cannot be used for large datasets, which is somewhat frustrating.)

I did not know about this further bug.

Thank you for posting it!!
April 4, 2006 17:33
 

Only Talking Sense » SSIS Lookup - Caching Problems said:

June 2, 2006 15:25
 

SSIS Junkie said:

A colleague of mine at Conchango, Kristian Wedberg, has started blogging on SSIS. Catch him here: http://blogs.conchango.com/kristianwedberg/default.aspx

January 16, 2007 18:10
 

joeydj said:

well guess im running into this one too

May 3, 2007 13:55
 

Cem Uney said:

Any solution for this empty string problem?

November 24, 2007 19:20
 

Kristian.Wedberg said:

No "solution" that I'm aware of, just the work arounds above.

That said, with the significant changes to the Lookup component in SQL Server 2008, it's probably worthwhile to check if that brings anything new to this issue...

November 26, 2007 12:11
 

Stay at home mom resource for making extra money online. said:

Join us today as we discuss The Vaccine Book b y Robert W. Sears. This is the second time that contributors of Silicon Valley Moms Blog , Chicago Moms Blog, DC Metro Moms Blog and NYC Moms Blog (along with the rest our friends throughout the blogsphere)

August 3, 2008 09:01
 

Erin said:

Just wanted to thank you for the post! It explained the problem I was seeing perfectly, and told me what to do to get past it. Thanks!

September 22, 2008 17:51

Leave a Comment

(required) 
(optional)
(required) 
Submit
Powered by Community Server (Personal Edition), by Telligent Systems