Welcome to blogs.conchango.com Sign in | Join | Help

Welcome to blogs.conchango.com

Kristian Wedberg - Business Intelligence, Data Warehousing, SSIS

A Swede on a walkabout

  • SSIS RetainSameConnection bug and SP2

    We have an AS400 source system and can only use ODBC connections to access it. We also need to set RetainSameConnection to true since we're initializing the AS400 connection with a number of temporary aliases that will be used by later Data Flows and Execute SQL Tasks.

    Turns out that when using the ODBC driver in SSIS with Execute SQL & Data Flow Tasks, there is a bug that makes it forget the RetainSameConnection setting (it will always revert back to false when saved & reopened.) When the connection is closed and new one created, our carefully created aliases are of course gone (since they were connection specific.)

    Some take aways:

    * Beware of this bug, the effect of not having RetainSameConnection is often subtle and hard to spot

    * RetainSameConnection _does_ work with OLEDB (although OLEDB is not really an option with out AS400)
     
    * The bug is fixed in SP2. There is a CTP of SP2 coming very soon to a web very near you;-)

     

    So let's be careful out there!

         Kristian

  • SSIS and “Cursor Operation Conflict”

    Watch out for this feature, not a bug thingy:

    In a Data Flow Task, my OLE DB Destination started to fail intermittently for no appearent reason. The error message was "Cursor Operation Conflict". Some head scratching later I discovered that the SQL Server 2005 database instance had had it's "no count" connection option changed to ON (the default after installation is OFF.)

    It seems that ADO expects getting back row counts, and gets confused when they're missing. Here's the culprit:

     

    no count server setting

    Here’s some background information.


    In short, make sure that the database instances that SSIS uses has the "no count" connection setting turned off (i.e. NO check mark for the "no count" property in the above dialog.)

     

          Happy hacking!

     

     

  • Database Snapshots and SSIS

    Did you think SQL Server 2005 database snapshots were really cool, but wondered how useful they were in practice? Building a Data Warehouse, here's an ideal use for them:

    In my current project we're extracting data for the data warehouse from several SQL Server 2005 source system databases. Since we're extracting from multiple tables from each database, and since those tables continue to be populated during our extract, there is a definite risk that the extracted data is out of sync due to the extracts from different tables happens at slightly different times.

    Database snapshots to the rescue

    By first creating a snapshot of the source system database, and extracting from that static snapshot (instead of the original, changing database), these synchronization issues are avoided!

    Even better, while the underlying database might be modified during the extract, the overhead for creating and using the snapshot is very small - it's proportional to the number of source system table pages changed during the extract.

    Each source system database to be extracted has the following tasks configured:
     

    Control Flow Image

    1. The first Execute SQL Task creates the snapshot. Since no data is copied, this is a virtually instantaneous operation. At this stage the source system Connection Manager still points to the original database.

    Creating the snapshot

    2. The first Script Task switches the database ConnectionString from the original database to the newly created snapshot database. This avoids having to create and keep a second connection manager in sync with the original database.

    Switch ConnectionString

    3. The Sequence Container contains all the Data Flow tasks etc. needed to move the source system data (now guaranteed to be static during the extract) into the data warehouse. NB: the following precedence constraints are "On Completion", so even if the extract fails, the snapshot will be deleted.

    4. The second Script Task switches the ConnectionString back to the original database, i.e. almost same code as 2 above, but change "Source_System_Snapshot_Extract" into "Source_System", and vice versa.

    5. The final Execute SQL Task simply drops the snapshot, again an instantaneous operation.

    Dropping the snapshot

    Wrapping up

    One thing to watch out for is when additional database files are created, either manually by the DBA or automatically due to the source system having a maximum file size set. The snapshot creation command must specify ALL database files, so either update the snapshot command manually when this happens, or extend the Script Tasks to automatically figure out the number and filenames required.

    In summary, database snapshots and SSIS have proved to be a very useful combination for extracting consistent data into the data warehouse.

  • SSIS lookup transform gotcha

    We recently used the lookup transform to retrieve foreign keys in a data warehouse project, and were plain lucky to notice the following issue:

    When looking up padded varchar strings, the records the lookup transform considers equal differs depending on what CacheType has been selected.

    As shown in the example below, using CacheType = None or Partial will match 4 out of 4 records, while using CacheType = Full will match only 2 out of 4 records. Now, it could be argued that both cases are correct, depending on your definition of string equality, but the insidious problem is of course that changing the CacheType changes what records get matched, which is quite unintuitive and could lead to hard to find bugs in your packages.

    Thankfully, this only affects padded varchar strings (which are fairly uncommon), and doesn't affect padded char strings (which are much more common.)

    Update 2: Microsoft says this behaviour is by design, and comes from SSIS (CacheType = Full) comparing strings differently from SQL Server (CacheType = None or Partial). Sounds like we're stuck with it; read on for details and workarounds.

    Demonstration package

    We'll demonstrate this by creating two tables, each with four records, corresponding to the four possible combinations:

    • Unpadded lookup string & Unpadded reference string
    • Unpadded lookup string & Padded reference string
    • Padded lookup string & Unpadded reference string
    • Padded lookup string & Padded reference string

    The SQL table creation code contains:

    CREATE TABLE Lookup (
     ID   INT IDENTITY(1,1),
     String  VARCHAR(10),
     LookupDescription VARCHAR(200)
    )
    GO
    INSERT Lookup (String, LookupDescription) VALUES ('string1', 'This lookup string was NOT padded')
    INSERT Lookup (String, LookupDescription) VALUES ('string2', 'This lookup string was NOT padded')
    INSERT Lookup (String, LookupDescription) VALUES ('string3   ', 'This lookup string WAS padded')
    INSERT Lookup (String, LookupDescription) VALUES ('string4   ', 'This lookup string WAS padded')
    CREATE TABLE Reference (
     ID  INT IDENTITY(1,1),
     String VARCHAR(10),
     ReferenceDescription VARCHAR(200)
    )
    GO
    INSERT Reference (String, ReferenceDescription) VALUES ('string1', 'This reference string was NOT padded')
    INSERT Reference (String, ReferenceDescription) VALUES ('string2   ', 'This reference string WAS padded')
    INSERT Reference (String, ReferenceDescription) VALUES ('string3', 'This reference string was NOT padded')
    INSERT Reference (String, ReferenceDescription) VALUES ('string4   ', 'This reference string WAS padded')

    The data flow uses an OLE DB source connecting to a SQL Server 2005 to get the records from the Lookup table, and looks up the four strings in the Reference table with a lookup transform.

    CacheType = None or Partial

    When setting CacheType to None or Partial, all four records match, i.e. the lookup transform disregards any padding the strings have:

    Lookup_None-Partial_CacheType.gif

    CacheType = Full

    When setting CacheType to Full, padding is considered part of the string and only two records match:

    Lookup_Full_CacheType.gif

    At a guess, the two types of behaviour comes from Full CacheType using a different SQL query when getting all reference records up front, compared to None and Partial CacheType which get the reference records only when needed, one by one.

    Workarounds

    Simplistic

    The (too) simple remedy is of course to just be aware of this behaviour, and use the appropriate CacheType depending on what string equality version is needed. This is however both error prone and restrictive, we really want to be able to choose the appropriate CacheType depending on number of records etc.

    Remove padding on the fly

    A fairly good alternative is to remove the padding inside the data flow, i.e.:

    • Trim the incoming strings either at the source component with SQL RTRIM(fieldname), or in a Derived Column component
    • Trim the Reference records by adding RTRIM(fieldname) to the Reference query inside the lookup component

    This is more flexible than the first alternative, but requires you to remove the padding every time you do lookups on these fields, again somewhat error prone.

    Remove padding ahead of time

    In our project we instead made sure that all varchar string fields were consistently unpadded when imported into the data warehouse, which works great when you have the luxury of being able to change the lookup and reference records.

    End game

    Long term, I hope the lookup transform will be updated to remove this 'feature', I'll feedback this to MS.

    Update: Reported to Microsoft.

    Try out

    To see this in action, download the package, configure the connection for your own database, run with debugging with different CacheType settings, and watch either 2 or 4 records get matched.

    Conclusion

    We've looked at an error prone feature of the lookup transform when matching padded varchar strings. Workarounds include removing padding from the strings before and as part of the lookup, or being very specific about which CacheType must be used for achieving correct behaviour in each instance the lookup transform is used.

Powered by Community Server (Personal Edition), by Telligent Systems