blogs.conchango.com

welcome to the conchango blogging site
Welcome to blogs.conchango.com Sign in | Join | Help
in Search

SSIS Junkie

Conchango are busy and need talented consultants in and around London. Interested? Email me or send me a message

SSIS: Flat File Connection Manager issues

SSIS has been out in the open for nearly 18 months now and in that time its become clear that certain issues are vexing customers time and time again. I wanted to take a little time to talk about the one that I feel results in the most irate criticisms. Its a furore about using the Flat File Connection Manager.

Let me set the scene. The Flat File Connection Manager is used to define the format of an external file in order that we can load its contents into the SSIS pipeline. Usually such files will contain rows that all have the same format but sometimes they don't - and that's where the problems start. Here's an example of a such a file: 

Order1,Jim
Order1,Toaster,2,9.99
Order1,Kettle,1,15.99
Order2,Alison
Order2,Radio,1,18.00
Order2,Toaster,1,9.99
Order3,Henrietta,
Order3,Radio,1,18.00
Order3,Television,79.00

This file contains 2 kinds of records. An order header record (order number, customer name) and an order detail record (order number, product, quantity, price). If you point the Flat File Connection Manager at this file and treat it as a comma seperated value (csv) file then SSIS will not be able to parse it. John Welch has a much more detailed illustration of the problem on his blog and also a method that you could employ to solve it. John's post is here and I recommend you go take a read of his excellent write-up before you carry on reading this.

Given that John has already outlined how you solve this problem I'm not going to explain it here. And besides, I already posted my own workaround (very similar to John's) back in July 2006. What I DO want to do is try to explain why SSIS behaves this way.


The predecessor to SSIS was called Data Transformation Services (DTS) and that tool would successfully load the file above by passing NULLs for columns 3 and 4 in the order header records. It is ex-DTS users that seem to complain the loudest about the behaviour in SSIS. And who can blame them? The fact that DTS would load this file and SSIS will not (without John's workaround) seems like a regression, right?

The simple answer to the question "Why has this been changed in SSIS?" can be summed up in one word. Performance. In order that DTS could cope with files like this it would parse each row twice; once to find the row delimiter and a second time to find all the columns in the row. In SSIS that's no longer necassary; because the file is expected to be in a certain format SSIS doesn't worry that the row delimiter might be in the wrong place hence each row is only parsed once. Obviously then doing it the SSIS way results in a lot less work and thus the file can be loaded a lot quicker.

There is actually another reason as well. The DTS behaviour is unpredictable and thus considered dangerous; in the worst cases it is possible that erroneous data be loaded and you would never know about it. In SSIS the behaviour is predictable; if erroneous data is discovered then the process will fail.


Here are a selection of quotes harvested from around the SSIS community that refer to this behaviour

"the behaviour of the current flat file parser is entirely predictable and was introduced in part because the old DTS behaviour could introduce unexpected results - it was also much slower....I would think it much more risky to behave as if they were not there or to make assumptions as to what was intended when a certain parsing was selected. And indeed, we have plenty of issues in our bug database and customer feedback concerned about the DTS parsing for these reasons." - Donald Farmer, formerly Group Product Manager for SSIS.

"We considered adding what you are asking for; always check if we get the row delimiter in addition to checking for the current column delimiter, and if the row delimiter comes before expected column delimiter pad the remaining columns of that row....We decided not to implement this by default for performance reasons. It is likely we will have it in the next version of the product, but only as an option that will need to be explicitly selected." - Bob Bojanic, Developer on the SSIS product team

"It's a bug.  No way is this a feature!  This is killing us too.  If I have to script this as set out below then something fundimentally is wrong with SSIS" - Garry Swan, SSIS Customer

"They just seem to have re-architected the way some familiar things worked and have not done a very good job of telling people how they have changed." - BarkingDog, SSIS Customer

"simply put, this is very non-intuitive behavior.  People expect a record delimiter to take precedence over a column delimiter.  Yes, it's self-consistent, but that doesn't mean it's "right"." - Beska Miltar, SSIS Customer

 

This post has been an aim to explain the issues around this particular "feature" because I know that a lot of people have experienced a lot of pain with it. What do you think about the current behaviour? Do you agree with the SSIS team's decision to remove the DTS behaviour? Do you think their rationale is sufficient? Let me know OK by posting a comment to this post. I deliberately haven't given my opinion - I want to know what you think.

 

-Jamie

Posted to SSIS (RSS feed) & Connection Manager (RSS feed)

Published 15 May 2007 05:14 by jamie.thomson

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Mark Garner said:

Seems OK to me.  SSIS is very metadata aware and mixing record types in a single flat file seems to me to actually be the problem here.  Those extracts need to be separated.

May 15, 2007 14:57
 

Matt said:

Yeah, I'm ok with this, but as one of the SSIS customers commented, it's just very frustrating when things that used to work in DTS no longer do in SSIS.

May 18, 2007 14:35
 

Steve said:

I too had this problem at first, but I think SSIS helps keep you more honest when using flat files.  SSIS ensures that the file is formatted the same throughout.  If there are ultimately 5 columns being pulled from a flat file, each row of the file should have 5 columns.  Logically, it makes sense in SSIS.  The DTS way, I think, was more "lazy" than intuitive.

May 20, 2007 02:35
 

Dan said:

This behavior fits with my opinion that we should always deal with "explicit" data, rather than "implicit".  

You're going to deal with the difference in the data lines at some point.  It seems better to deal with it up front before it unexpectedly corrupts my data, versus the alternative.

May 22, 2007 15:25
 

Peter said:

To me, the behavior acts like a bug.  If I've defined an end of row character (e.g. CRLF) and that is reached before the expected # of columns, move me on to the next line.  I wouldn't have such a huge problem with this if one of the leading generators of these types of files wasn't MS Excel.  Try saving out the data from Excel into a tab-delimited or CSV file if it's got some of the right-most columns not filled in.  Then try importing that straight in through SSIS - it doesn't work because Excel doesn't save all of the delimiters to the end of the row.

I've run into this more times than I can count and it's really painful, especially when you add in the fact that Excel's data-typing is so poor to start with that I am practically forced to dump it to another format before trying to import it.  (Zip Codes interpreted as Float due to the first ### rows and then including some Zip+4 or Intl codes anyone?)

If they're going to introduce a ragged-right handler for delimited data, that would be really useful to me.  Also useful would be something that just forced all Excel values as (n)varchar types so I don't lose data on import.  (Different topic :-)

As for their explanations - I see their point of view in wanting this to be fast, but to not even give an option to handle data well when exported from their own software is poor design, to me.  As for the workarounds, I'm in a position where I can manually export the data and include a copy of a column with all data included as a dummy row at the end to force all of the delimiters.  Were I in a job handling a lot more files, I'd come up with something else or look at the scripting solution.

May 22, 2007 19:30
 

Dora Manjula said:

The text file that looks like :

"INV_CLS","LD_DT",

"2","2007-06-01"

"2","2007-06-01"

"2","2007-06-01"

works in DTS but f ails to parse in SSIS. Any ideas how to fix this ?

June 5, 2007 20:49
 

M Noreen said:

There's seems to be a "technical" side and a "use case" side to all of this.  Sure, as a developer and technology person, I "get it" about how the connection mgr actually works and the need to do things for performance sake, and what not.

But... here's the other side of the coin.  We're supposed to buy all the hype about how we can "easily" move data across systems using SSIS.  What are the real use cases for that?  Well, reality dictates that there files out there with varying columns.  

We get a lot of variations of address information, for example.  Some people use 2 fields (address1 and address2), others do not.  For there's a "name" field or sometimes "first name" and "last name."

Even MORE common... files that are generated by mainframers.  There is always a header record, then detail records that follow (and yes there could be different kinds of detail records) and finally a trailer record.

That's so common, I've encountered it across several businesses that I've worked with.  Anyone is likely to trip over this that works w/ any kind of bank!  And the type of people using SSIS are also the type of people who will encounter this kind thing eventually!

So, while I understand the technical reasons for the flat file mgr... surely Microsoft (or someone) would recognize the business need.  Maybe we need a new/custom Connection Mgr for dealing with mainframe files or just files with different structures on a row-by-row basis.   I'd want to it to open up a file then do some sort of specialized "conditional split" sending the row down a path depending on a certain evaluation.

I'd accept the "performance hit" if it would just plain work.

In the end, we find it easier just to write a simple little console app in C# to do this rather than SSIS, so I'm not only faced w/ a frustration, but just the fact we driven AWAY from SSIS altogether.  What a shame.

July 2, 2007 19:37
 

jamie.thomson said:

M Noreen,

Thanks for the great comments.

A couple of questions for you.

1) Yes we want to move data easily - I agree completely. Are you prepared to introduce the possibility for error in order for this "ease of use"?

2) The mainframe example I totally get. Back to your "name", "first name", "last name" example though. Do you think that SSIS should legislate for inherently bad data capture? because in my book, that is what that is. I am of the opinion (and this is nothihg to do with SSIS/DTS/Microsoft/whatever) that if systems are providing badly structured data there should be an effort to correct that at source. Just my opinion.

3) Why did you build a console app rather than a SSIS custom component? You have obviously already worked out the logic so changing it into a custom component wouldn't be too much of a stretch.

Thanks for the comments.

-Jamie

July 2, 2007 21:15
 

M Noreen said:

Jamie --

1) and 2) are similar questions to me... That is, yes, I realize that by choosing a "flexible flat file connection manager" that I am also choosing to deal with potential errors at runtime that weren't accounted for at design time.

I guess in this case, I'd rather deal with the occasional exception (aka "error") in exchange for ease of use.

2) I don't consider this "bad data" per se... Just ever so slight variations on a theme.   But maybe that's just me :)

It would be nice/cool/sweet... if I could have my Flexi-Connection-Mgr that would let me dynamically map to named columns if present.  And I would certainly sacrifice some performance for that flexibility.

Once I had my core columns figured out, the rest of the really cool features of SSIS kick in with sending it through the various data source transformations.

Anyway the mainframe data is the truly the biggest issue I come across over and over again.  So dealing w/ that more directly (rather than using a zillion substring derivations) and visually would be nice.

3) that's kinda complicated.  Me, I'm a generalist.  I use Visual Studio, develop .NET pages using everything but the kitchen sink (3rd party custom controls, master pages, CSS, etc.).  In the past I have developed COM objects, and written more #include files and .asp pages than you can shake a stick at.  I even wrote a custom COM-based object to deal with loading up a DTS package and changing it's connections and other items based on user submitted data before executing it.

I'm the resident "developer DBA" too and have learned to be pretty dangergous w/ SQL Server!

But... all this technology is getting tiresome.  I just want to get the work done.  I have some mainframe files I need to automate importing and it was just much more straightforward to do fire up Visual Studio and code up the whole thing.

SSIS is extraordinary, truly.  I was pretty good at DTS, but love that it has gone away (I do NOT consider SSIS an upgrade but something new altogether).  But all that comes at the price of learning yet another technical environment ... and you have to admit, it's not exactly intuitive!  

So when faced w/ having to code something up anyway, I just went to what I was already familiar with in order to get it done quickly.   I'm not importing millions of records... only a few thousand here and there.  I've got SQL 2005 64 bit running on a 64 bit server... it can handle it even my "under peforming" little utility app.

And... there are lots more C# developers around than SSIS programmers.  So when I'm not around (the hit-by-a-bus-theory), it will be easily picked up.  SSIS, while cool to someone like me, is an eclectic choice.

(more than you asked for?)

-- Margo

July 12, 2007 19:51
 

jamie.thomson said:

Margo,

Fantastic rsponse. Thank you so much for taking the time.

There's nothing in there I can really take issue with, all valid points.

You say the environment isn't intuitive. Its hard for me to judge that knowing it so well as I now do. I'm trying to cast my mind back 3 years to when I first saw it and yeah, I guess it did take some getting used to. Having said that I had the advantage of knowing alot about Informatica so I was much more familiar with the paradigm than a DTS-only developer would have been. It some ways it was my Informatica heritage that caused me to look at SSIS and say "This is cool, its just like Informatica. here's something I can really make a play with". The rest is history.

Thanks again for the comments. You should have your own blog for this stuff :)

-Jamie

July 12, 2007 20:09
 

Anand said:

Help me in this scenario.

All my files first row should have column names. In some files the header row is missing . I am passing the flatfile connector dynamically using a variable.

for the files those are missing header rows should be errorerd out.  nnnooo ITS not error out YYYYYYYY..?

This is the template file .

fname,Lname,gender

abc,abc,m

xyz,xyz,F

all files with similar schema should process correctly.(IT SHOULD HAVE header row)

ironically SSIS processed the following file also ( this file does not have header row)

ABC,XYZ,m

xyz,xyz,f

July 28, 2007 01:12
 

wayne said:

HI guys if this is the case how am i supposed to output xls files and populate them from tables dynamically

August 15, 2007 11:02
 

John said:

Access handles this just fine.  Any rows that are correctly formatted are added, while the rows that are incorrect are logged to a separate error table. There is NO EXCUSE that Microsoft changed this behavior without providing an election to A. Process the data the old way, or B to skip and log the bad row numbers.  This should be built right into the Import Wizard.

I'm just FURIOUS over the wasted time this poorly managed "feature" -- which was a bug in DTS 2000 and is now just a different bug in SSIS.  Inexcusable.

December 8, 2007 16:56
 

pav said:

Hi, I am new to SSIS . In my project i have to download files of different csv formats like some file contain double quotes and some contain no double quotes .i have to it in a single package . how can i do it? please help ?

December 26, 2007 12:58
 

jamie.thomson said:

pav,

Sorry, I'm not going to do your work for you, go away and work it out for yourself.

Here's some links that might help:

Online Beginner Resources

http://blogs.conchango.com/jamiethomson/archive/2007/01/30/SSIS_3A00_-Online-Beginner-Resources.aspx

-Jamie

December 28, 2007 16:56
 

MarkC said:

Jamie,

 I'm using the Flat File Destination shape and attempting to set the connection property on the flat file connection manager via a variable. This works great from my developer installation. When I deploy the package and our dba loads it on the test sql server, the connection manager fails. We have been able to repro this issue on the sql server in VS in the debugger. The VS environment behaves differently than in my developer environment. Mine has SP1 installed and the sql server used by the dba does not.

 Are you aware of any dependancies on the version of VS present and the execution of a SSIS package under SQL2005? We have not been able to get around this problem and are looking into applying VS sp1 to the sql server.

Thank you,

Mark

January 25, 2008 22:08
 

Larry Coleman said:

"In SSIS the behaviour is predictable; if erroneous data is discovered then the process will fail."

Having run into this problem in the past, I'm having trouble understanding this statement. In my opinion, the behavior is predictable; the process will silently put things in the wrong columns or wrap two lines together. Failing would be an improvement, because we'd know right away instead of waiting for a customer to complain about missing data.

February 1, 2008 15:45
 

jamie.thomson said:

Larry,

Fair point. But if data does "silently go into the wrong column" then unless you're very very unlucky you will darn well know about it downstream when an operation is attempted that causes an error.

If you're inserting data into a table (and when would you not be in this scenario) and the table is constrained (and if its not then more fool you) then it will blow up on insertion if it hasn't done already upstream.

-Jamie

February 2, 2008 01:03
 

Mark said:

Regarding my post 1/25/2008. The problem I found ended up being the overwrite property. It apparently will not overwrite an existing file regardless if this is set to true. The error given makes you think there are bad characters in the path name and when your using variables to set paths, then I ended up wasting time trouble shooting other possible causes.

I've also talked with MS and thankfully VS isn't coupled in anyway to SSIS. I doubted they were, but the errors and behavior I was seeing had me wondering.

Regards,

Mark

February 13, 2008 19:35
 

Joe Horton said:

I have a delim flat file destination I am dumping too.  However; I don't

seem to have any control in what order the columns are sent out.  My only

workaround was to use Copy Column and have that in the right order then i

can control the order.  The columns are in the right order from the

pre-ceeding transform - I don't get it?

Comming from the pipe the col's are in this order:

Col1, Col2, Col3, ...

But when I do the mapping it does:

Col2, Col1, Col3,...

I don't seem to have a way to correct what order the columns exist in the flat file.

March 3, 2008 22:30
 

jamie.thomson said:

Joe,

You have to change it in the Flat File Connection Manager.

-Jamie

March 3, 2008 22:32
 

Joe Horton said:

I open the flat file connection mgr and see where to mess with what columns and sizes to dump, but it won't let me re-arrange the order.

March 3, 2008 22:42
 

Joe Horton said:

Wait a sec - looks like I can delete them then add them one by one in the order I want...

thanks

March 3, 2008 22:44
 

Steve said:

You say (in other forums) you do not consider this a bug and it cannot be compared to other products.

I do not know if being a Moderator MVP on the Microsoft forums means you are a MS employee, whether you are or not you ARE essentially you are saying don't have your clients use other Microsoft products to provide data to SQL 2005 SSIS because SSIS shouldn't be expected to work like, or play well with data touched by other MS applications.

(They can't use Excel to create reports and CSV export it for you to put on a website)

You also say this is a "feature" that is set this way for "performance".

HELLO, all the work arounds require the the file to be processed twice (and probably far more inefficiently than SSIS would if the ROW delim took prcidence over the column delim), and now EVERYONE looses job performance by having to do a work around for every file they touch because what used to work in previous versions of SQL doesn't anymore, AND what works compatibly with most other MS applications wont work with SSIS.

How about like so many people have asked.  A checkbox that says Process row delims before column delims, and let us suffer the performance hit rather than suffer the writing of a custom workaround for every file we ever have to import!?

By performance hit, I laugh also, because the fastest way ever to snapshot replicate data from DB to DB was CSV export the table, zip it up, FTP it over, unzip and CSV import it.  WAAAY faster than any other type of data transfer.

How much performance are you hoping to gain by not looking for the row delim before the column delim?  Really, just put in an extra little function that allows this, if I were to write this in any language, I could implement the row delim search as small addition to the routine, and 99% of the time for "real world data" where clients provide data to be worked on in the DB, the performance gain would be in ME not having to use a workaround for every file I need to touch, but for SSIS (and thus the build in SQL 2005 flat fikle import wizard) to work as expected where a row delimter means delimit the GD row!

March 10, 2008 22:04
 

Norm said:

I agree with Steve.

Also,

Why can't the row be skipped or redirected in the error output section of the Flat File source?

Typical of all these DTS to SSIS changes is the response from Microsoft. It is not a bug if we just say it isn't.

Now I have SQL 2008 SSIS to look forward to. Weee. More wasting time trying to code performance killing "solutions" to bugs that don't exist to get base functionality that worked years ago to work the same damn way.

Thanks.

March 26, 2008 15:53
 

Debbie S said:

I have a couple of questions as a complete newbie...

Question 1: I'm actually wondering how folks got the Flat File Connection Manager to work. I started doing an evaluation of SSIS yesterday and was going through the tutorial. I was following the instructions and kept wondering why the debug kept showing failures. I discovered that when I specified a flat file name as soon as I save the package the file name and path disappears. Has anyone else had this problem?

Question 2: Our Microsoft account rep told us that we could use SSIS with SQL Server 2000, but when I attempted to use it, it claimed that the service was not installed on the database server I was trying to access. Is there actually a way to use 2005 SSIS against a 2000 SQL Server database and open existing DTS packages? Or must the database be upgraded to 2005 before you can do anything with it?

Thanks!

April 23, 2008 00:46
 

jamie.thomson said:

Dbbie,

1) Sorry debbie, I've never noticed the file path disappearing. i sspect this isa  symptom of something else.

2) SSIS can extract data from and insert data into SQL 2000, that is absolutely true. That's not what you're trying to do though by the sounds of it. Are you trying to connect to the SSIS service on a SQL200 box? You won't e able to do that, the SSIS service comes with SQL2005.

-Jamie

April 23, 2008 11:37
 

Mukunda said:

hi jamie

My flat file some of the rows missing the column now that flat file is generated by the other programme i dont have access to change the file creation. how can i proceed to do this. I thought of reformate the flat file by reading line by line and create the new flat file with all the column delimeter. is this approach is good... or else give some suggestion Thanks in advance..

Mukunda

May 2, 2008 07:06
 

Kmurphy said:

I'm trying to parse the fields in the SSIS package, from <text>~<text>... to  text text, any ideas? there are about 40 columns that I have to use.  

May 20, 2008 19:54
 

Adnan said:

Slightly unrelated, but I have completely given up on this.  Don't know if anyone has faced this problem before, but I really hope so, as I have just spent the last 2 hours on this.

I have a flat file connection manager that uses an expression for the RowDelimiter attribute.  The expression is just a simple read from a variable.  The variable gets populated from the DB fine, no doubt about it.  But it seems that the attribute value is getting ignored completely, as the file is not being parsed correctly.

I have even tried replacing the expression value with a simple constant like 'LF', and placing that value in the middle of my file.  Still doesn't work, so it has nothing to do with the value of the variable, but the expression and/or the attribute value are getting ignored completely.

I am wondering if this is a bug with SSIS.

Thanks,

Adnan

June 3, 2008 14:58
 

Mark K said:

Agree 100% with Debbie S: file path disappearing. Especially if it's a network path. Any suggestions?

June 7, 2008 02:48
 

Gary P said:

I have an odd problem with the Flat file manager for output.  Aside from the fact it wants a file to exist (even though it will be creating it-I figured out how to work around THAT), the problem I have is that if I change the file name in the connection manager and rerun the export, it sends it to the old file name!  I have looked in the Package.dtx file and nowhere does it have the old file name.  The connection manager shows the new file name.  What the?

...

I may have found the problem looking further which I'll  so someone else can benefit.  I am using  a XML configuration file for the project, and I just found the old file name still in the .dtsconfig file!  Apparently changing it in the designer does not make the corresponding change in the .dtsconfig, and it must be using the dtsconfig when it runs in the designer and ignores any changes to properties that were changed in the designer.  This behavior could cause chaos for any properties placed in a config file.

Is there a way to link the config information to the designer without having to go into the configuration and change each setting manually?

July 2, 2008 00:53
 

jamie.thomson said:

Gary P,

Thanks for the comment.

"Is there a way to link the config information to the designer without having to go into the configuration and change each setting manually?"

No there isn't. Perhaps an option to "Update configuration with current designer settings" would be nice. Perhaps you could suggest it at http://connect.microsoft.com/sqlserver/feedback?

-Jamie

July 2, 2008 09:39
 

Gary P said:

Jamie,

I'll do just that.  Thanks

Actually, the way it behaves is that if you open the configuration (which is not the easiest thing to find anyway), it magically changes ALL the exported properties to the current designer values.  I'm not sure if THAT is the best behavior either!

July 4, 2008 00:28

Leave a Comment

(required) 
(optional)
(required) 
Submit

This Blog

Syndication

News

Powered by Community Server (Personal Edition), by Telligent Systems