SSIS has been out in the open for nearly 18 months now and in that time its become clear that certain issues are vexing customers time and time again. I wanted to take a little time to talk about the one that I feel results in the most irate criticisms. Its a furore about using the Flat File Connection Manager.
Let me set the scene. The Flat File Connection Manager is used to define the format of an external file in order that we can load its contents into the SSIS pipeline. Usually such files will contain rows that all have the same format but sometimes they don't - and that's where the problems start. Here's an example of a such a file:
Order1,Jim
Order1,Toaster,2,9.99
Order1,Kettle,1,15.99
Order2,Alison
Order2,Radio,1,18.00
Order2,Toaster,1,9.99
Order3,Henrietta,
Order3,Radio,1,18.00
Order3,Television,79.00
This file contains 2 kinds of records. An order header record (order number, customer name) and an order detail record (order number, product, quantity, price). If you point the Flat File Connection Manager at this file and treat it as a comma seperated value (csv) file then SSIS will not be able to parse it. John Welch has a much more detailed illustration of the problem on his blog and also a method that you could employ to solve it. John's post is here and I recommend you go take a read of his excellent write-up before you carry on reading this.
Given that John has already outlined how you solve this problem I'm not going to explain it here. And besides, I already posted my own workaround (very similar to John's) back in July 2006. What I DO want to do is try to explain why SSIS behaves this way.
The predecessor to SSIS was called Data Transformation Services (DTS) and that tool would successfully load the file above by passing NULLs for columns 3 and 4 in the order header records. It is ex-DTS users that seem to complain the loudest about the behaviour in SSIS. And who can blame them? The fact that DTS would load this file and SSIS will not (without John's workaround) seems like a regression, right?
The simple answer to the question "Why has this been changed in SSIS?" can be summed up in one word. Performance. In order that DTS could cope with files like this it would parse each row twice; once to find the row delimiter and a second time to find all the columns in the row. In SSIS that's no longer necassary; because the file is expected to be in a certain format SSIS doesn't worry that the row delimiter might be in the wrong place hence each row is only parsed once. Obviously then doing it the SSIS way results in a lot less work and thus the file can be loaded a lot quicker.
There is actually another reason as well. The DTS behaviour is unpredictable and thus considered dangerous; in the worst cases it is possible that erroneous data be loaded and you would never know about it. In SSIS the behaviour is predictable; if erroneous data is discovered then the process will fail.
Here are a selection of quotes harvested from around the SSIS community that refer to this behaviour
"the behaviour of the current flat file parser is entirely predictable and was introduced in part because the old DTS behaviour could introduce unexpected results - it was also much slower....I would think it much more risky to behave as if they were not there or to make assumptions as to what was intended when a certain parsing was selected. And indeed, we have plenty of issues in our bug database and customer feedback concerned about the DTS parsing for these reasons." - Donald Farmer, formerly Group Product Manager for SSIS.
"We considered adding what you are asking for; always check if we get the row delimiter in addition to checking for the current column delimiter, and if the row delimiter comes before expected column delimiter pad the remaining columns of that row....We decided not to implement this by default for performance reasons. It is likely we will have it in the next version of the product, but only as an option that will need to be explicitly selected." - Bob Bojanic, Developer on the SSIS product team
"It's a bug. No way is this a feature! This is killing us too. If I have to script this as set out below then something fundimentally is wrong with SSIS" - Garry Swan, SSIS Customer
"They just seem to have re-architected the way some familiar things worked and have not done a very good job of telling people how they have changed." - BarkingDog, SSIS Customer
"simply put, this is very non-intuitive behavior. People expect a record delimiter to take precedence over a column delimiter. Yes, it's self-consistent, but that doesn't mean it's "right"." - Beska Miltar, SSIS Customer
This post has been an aim to explain the issues around this particular "feature" because I know that a lot of people have experienced a lot of pain with it. What do you think about the current behaviour? Do you agree with the SSIS team's decision to remove the DTS behaviour? Do you think their rationale is sufficient? Let me know OK by posting a comment to this post. I deliberately haven't given my opinion - I want to know what you think.
-Jamie
Posted to SSIS (RSS feed) & Connection Manager (RSS feed)