blogs.conchango.com

welcome to the conchango blogging site
Welcome to blogs.conchango.com Sign in | Join | Help
in Search

SSIS Junkie

Conchango are busy and need talented consultants in and around London. Interested? Email me or send me a message

SSIS: Hiding columns in the pipeline

Yesterday I received an email from a SSIS developer asking questions about some peculiarities of SSIS. One of the points she raised was thus:

I found i spent alot of time trying to figure out how to get rid of un-needed fields in the flow.  Not all transforms  allow you to change the output columns.  This is sort of an organizing and manageability issue.

I get rather concerned when I read this because clearly people are jumping to the conclusion that removing columns from the pipeline will be beneficial to them. Let's be honest, who can blame them? Its intuitive to believe that if columns are removed from the pipeline then the pipeline engine will have less work to do thus it will be quicker but unfortunately that's not the reality. Let me (try and) explain.

Judging by what the emailer above said "Not all transforms  allow you to change the output columns" she discovered that introducing an asynchronous component (such as a Sort, or an Aggregate) into the pipeline enables you to choose the output columns, seemingly reducing the load on the pipeline engine. The reality is that asynchronous components cause the engine to create a new buffer (i.e. allocate a new portion of memory) and this requires contents of memory to be copied from one place to another. Copying memory is an intensive operation and far more detrimental to performance than having columns in the pipeline that aren't used any more. Once columns are in memory all the hard work has been done so they aren't really having any negative affect (even if they are no longer used) so actually the only advantage in removing them is reducing clutter in the designer UI.

Also consider this. If a column is in the pipeline then its there for a reason, and if its not you'll get a warning:

The output column "col_name" (col_id) on output "output_name" (output_id) and component "component_name" (component_id) is not subsequently used in the Data Flow task. Removing this unused output column can increase Data Flow task performance.

If you don't get any of those warnings then every column is there for a reason. Just accept that its in memory, it will be used, and when that particular portion of memory (i.e. buffer) has been processed then the memory space will be marked as available. Physically removing the column from memory would take extra processor cycles and thus performance suffers.

 

So, the message is simple, do not actively seek to remove columns from the pipeline using an asynchronous component.

Do you need some statistics to prove it? My friend Darren Green investigated the affects of copying memory versus leaving unused columns in the pipeline and he documented it here and here; hopefully he won't mind my restating some of it below:

Something that people, including me get a bit worried about, is the fact that you end up with these big wide buffers. The classic scenario is when you need to convert between string and Unicode string or visa versa. You end up duplicating all your columns, which seems really horrid. Well it may not be great, but the alternative is even worse.

Data Conversion and Derived Column transforms can do the conversion, but are synchronous, so we have duplicate columns in effect. Would it be better to have a custom component that worked asynchronously, so you didn't end up with two columns for all your strings?

The short answer is no. I wrote a test component, cunningly named DeUnicodeAsynchTestComponent, and simply put it was 2 - 2.5 times slower than a data conversion transform. The cost of copying data between buffers for the asynchronous nature is far more expensive than the extra baggage of the "duplicate" columns.

-Darren Green - 10th May, 2007

Hopefully that has convinced you of the negative impacts of introducing asynchronous components unnecessarily. I have talked before about this peculiarity of the pipeline in my blog entry Dataflow Mechanics in October 2007. I'd urge you to take a read of that blog entry to try and understand a little more about the inner workings of the pipeline engine.

 

One last point I would like to make is something I alluded to above. Hiding columns in the designer (but not actually removing them from the pipeline) can be useful in reducing clutter and to that end I have raised the following request on Microsoft Connect:

SSIS: Hide columns in the pipeline
(https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=252462)

Feel free to click-through and add your weight to the argument to get this feature implemented.

-Jamie

Published 16 January 2008 02:12 by jamie.thomson

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Leave a Comment

(required) 
(optional)
(required) 
Submit

This Blog

Syndication

News

Powered by Community Server (Personal Edition), by Telligent Systems