blogs.conchango.com

welcome to the conchango blogging site
Welcome to blogs.conchango.com Sign in | Join | Help
in Search

SSIS Junkie

Conchango are busy and need talented consultants in and around London. Interested? Email me or send me a message

SSIS: Dataflow mechanics

 

This is a post I've been meaning to write for some time but as you may have noticed my blogging output has been pretty lax lately.

During my activity on the SSIS forum I've noticed that much of the content is in regard to the dataflow task and that's not a surprise given that its the most useful tool in the SSIS box and also the most complex. This post is me brainstorming some of the stuff that I know about the dataflow and hopefully it proves useful to some of you.

  • Buffer Architecture. If I'm ever interviewing you for a job as a SSIS developer you can lay a lot of money to say that I'll ask you to tell me what a buffer is. Buffers are fundamental to the dataflow - they are what the dataflow uses to move data around. A buffer is essentially an area of memory and by default consists of approximately 10000 rows (usually slightly less than that) and that's why when you execute a dataflow within BIDS the row counts on the data paths go up in approximate increments of 10000. Performance tuning a SSIS dataflow is essentially all about manipulating various properties until you find the optimum number of rows in each buffer and you can read more (much more) about that here.
  • Dataflows contain components which are generally categorised into synchronous and asynchronous. The most definitive description of these is that the output from a synchronous component uses the same buffer as the input; asynchronous components create a new buffer for their output. All source adapters are asynchronous components, all destination adapters are synchronous. Synchronous components are generally quicker than asynchronous components.
  • Asynchronous components are further categorised as partially-blocking or fully-blocking. Fully-blocking components require all rows from upstream before they put any data into the output; partially-blocking components will start to output data before they receive all upstream rows.
  • Execution trees. Each asynchronous component creates what is called an execution tree in the dataflow. In SSIS 2005 (but not in SSIS 2008) each execution tree uses one execution thread so another part of performance tuning is to fully utilise all processors on your hardware. Read more here.
  • OnPipelineRowsSent. All executables in a SSIS package throw events and one of the events throws by the dataflow is OnPipelineRowsSent. When a component outputs a buffer of data then it throws a OnPipelineRowsSent event and thus enables us to know how many rows each component has processed. When you execute a dataflow within BIDS these events are consumed and are used to change the rowcounts that you see increasing as more rows are processed.
  • Spooling. I said earlier that all buffers are a space in memory but of course memory is finite so if there is more data in the pipeline than can fit in memory then buffers will get spooled to disc. The location on disc is defined by BLOBTempStoragePath & BufferTempStoragePath. Spooling will severely impact dataflow performance so avoid if possible.
  • A lot of people ask if its possible to remove columns from the dataflow once they have finished using them. For example, if columns called [FirstName] & [LastName] are concatenated together to make [FullName] its likely that those two columns won't be needed anymore. The simple answer though is no. Once the data is in memory it would be an overhead to remove the data and "squeeze" the buffer up to make it slower which is why those columns still appear downstream. This is nothing to be concerned about - its highly highly unlikely they are heavily impacting performance. Of course, if an asynchronous component is encountered then a new buffer will be created on the output and the unrequired columns will (probably) be removed. This issue is further discussed here.
  • Following on from the previous point...its intuitive to think that columns that begin at a component don't exist prior to the data being processed by that component. In fact that's not true. Prior to dataflow execution the execution plan for a dataflow is determined and it is at that point that all columns are defined and thus created. So, all columns that will be used in a buffer exist even before the buffer gets any data.
  • The datatypes of columns in the dataflow are different from datatypes used for SSIS variables. To this day I don't understand why the SSIS team opted to use different datatypes in the cotrol flow and data flow and I hope this changes one day.
  • AllThe stock components (i.e. those provided out-of-the-box) are all mostly written in native code. SSIS provides a .Net API that enables you and I to build our own components and hence it is tempting to think that these custom components won't work as quickly as stock components. This is probably true but really the difference is negligible. The majority of the work (validation, memory management, buffer editing etc...) is done by native code so you're not going to suffer severe performance problems by implementing custom components.
  • The BLOB data types (i.e. DT_TEXT, DT_NTEXT, DT_IMAGE) can severely impact dataflow performance so try and avoid them if you can.
  • Raw files can be used to pass data from one dataflow to another - even if those dataflows are in different packages. Raw files have a proprietary file format that is essentially a match of the data in memory and hence reading to and writing from them is extremely quick. people often seem reticent to place data into raw files but I don't hesitate to recommend using them if you need to.
  • Components in the dataflow are joined together by data paths. There is an important property on each data path called IsSorted. A lot of people think that setting this property to TRUE will cause the data in that dataflow to be sorted. That's not true - this property only tells the dataflow that the data is sorted, nothing more. If you set this property to TRUE and the data is not sorted then you may be creating problems for yourself later on.
  • Source and destination adapters maintain external column collections which are used to store the metadata of the external data sinks that those adapters connect to. There are two reasons for this as far as I can determine. Firstly to enable offline development (a big criticism of DTS was that offline development wasn't possible) and secondly to enable the dataflow to validate itself. More information here.
  • Although it appears in BIDS as though the data in a buffer "moves" from one component to another that isn't actually the case. Data in a buffer doesn't actually move about in memory. Someone once used an analogy of cars travelling on a road to describe this. The buffers are analogous to cars on the road and milestones along the road are analogous to the components. Instead of thinking of the cars moving along the road to reach the milestones, think of the cars as being stationery and the road moving along underneath the cars.

I'll probably add to this post over time as new things occur to me. In the meantime if you want a more detailed description of how the dataflow works then Kirk Haselden's book has a whole chapter devoted to it. You can also pose questions in the comments although I'd urge you to post questions to the SSIS forum where more people will be available to answer and where your question may already have been answered. Hopefully my fellow MVPs will follow up with some insights of their own as well (if they're reading).

In a future post I'll provide more specific information about the out-of-the-box components.

-Jamie

Published 09 October 2007 04:39 by jamie.thomson

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Matt Masson said:

"All stock components are all written in native code."

Almost true :) there are a couple of managed components - XML Source and Script Component come to mind.

Very nice post.

October 10, 2007 07:26
 

jamie.thomson said:

Ah yes. Good point Matt. I've edited accordingly.

October 10, 2007 13:15
 

Yassine said:

Hi Jamies,

Thanks for your Brainstorming above makes a lot of things clear, but i still have a question related to data flow task's block.

Lets say i have a problem to solve that consist of 5 major transformations and each transformation step ends with a set of dataset i want to use in the next step.

now i can solve all the the problem in one data flow task block but this is not my aim, i would like to solve it in 5 differnt Block (Data Flow Task)'s block.

my question is how can pass result set of a block to an other block as an input to operate on?

thanks in Advance

Yassine

October 11, 2007 11:23
 

jamie.thomson said:

Yassine,

Please explain what you mean by "block". That is a new term to me.

If you want to pass data from one dataflow to another (which I think is what you are asking) then use a raw file as I stated above.

-Jamie

October 11, 2007 14:03
 

Steve Le Monnier said:

Hi Jamie

I've seen in many post that you recommend using Raw Files to move data between dataflow tasks and packages. This seems sensible to me however I've struggled with a problem and have been unable to see how it would be possible to move data between dataflow tasks.

In example one I have a dataflow task that outputs to a recordset object, which is saved as a variable. Control then passes to the Foreach control which can use variables to read in the columns record-by-record (no code scripting required!).

However then I tried example two using a raw file destination I found I had no way of getting the Foreach container to use the data, unless I read it into a dataflow task and convert it into a recordset object... if I do this I might as well stick to example one?

Hopw can you prepare data in a dataflow task and then have a foreach loop feed off that data?

Many thanks for any assitance you can provide.

Steve Le Monnier

steve _ lemon @ hot mail . com

October 24, 2007 13:41
 

jamie.thomson said:

Steve,

You're absolutely right. if you want to to iterate over a recordset using the ForEach loop then you have to use the recordset destination. This is basically the rationale for the recordset destination existing - I can't think of another reason for using it.

-Jamie

October 24, 2007 14:03
 

Steve Le Monnier said:

Thanks Jamie

I'm slowly working through all my SSIS books as I wean myself off DTS.

Typical, none of these books come close to the insights I have gleamed from your articles.

Loving your work, great job.

Steve

October 24, 2007 14:45
 

Chris said:

Hi Jamie,

I'm just beginning SSIS and do not have any experience in the previous DTS flavor. I have a data flow transformation question on what would be the best and easiest way to do "peeking" at the next record then going back to the prior record to add a field.

An example is a timeline and appending the end date when you only have the begin date. Data coming in would have something like [PK,group,begindate] and I need to transform it to [PK,group,begindate,enddate]. The data coming in is already ordered by group, begindate. What i want to do is "peek" at the next row and get the beingdate field and then go back to the previous row and add the peeked at begindate - 1 day.

What would be the best way to go about this?

[PK,Group,BeginDate]

Row 1: 1, Group1, 11/9/07

Row 2: 2, Group2, 11/14/07

Row 3: 3, Group2, 11/28/07

Transformed to be:

[PK,Group,BeginDate,EndDate]

Row 1: 1, Group1, 11/9/07, 11/13/07

Row 2: 2, Group2, 11/14/07, 11/27/07

Row 3: 3, Group2, 11/28/07, GetDate() + 60

(note: last row would just do a GetDate() + 60 or something like that)

November 9, 2007 15:55
 

Farrukh Pasha said:

hey Chris,

I think you can do that with adding a derived column and specifying the name and the expression {GetDate() + 60} or something like that in the expression textbox. Hope it solved ur problem.

FP

November 20, 2007 07:14
 

SSIS Junkie said:

Yesterday I received an email from a SSIS developer asking questions about some peculiarities of SSIS.

January 16, 2008 02:12
 

payday loan on line said:

Complete cash advance service advance cash loan loan payday

January 25, 2008 10:53
 

mosquit0 said:

We currently use DTS with SQL 2000 and will soon be upgrading to SQL 2005 and SSIS.  I have a question.  On our DTS packages which load SQL tables from flat files, if the file is empty the package fails and any steps after the load are not executed.  To avoid this, for packages where the file may be empty we have an active X script associated with the workflow properties on the data transform task which checks file size and if it is 0 does not execute the data transform task.  

The Upgrade Advisor flagged these packages as having migration problems because of this.

My question is, does the data flow task of SSIS also have a problem with empty source files in this situation, and if so, what is the best way to handle this?

February 14, 2008 15:47
 

mosquit0 said:

After reading further, is it true that the data transform (data pump) task will not migrate at all?  I set up the simplest possible one copying columns directly from a text file to a SQL table and the Upgrade Advisor flagged it as a complex data transformation which needed to be replaced.  Can someone clarify this for me?

February 14, 2008 19:28
 

jamie.thomson said:

mosquit0,

As far as I know, this is true.

-Jamie

February 15, 2008 01:59
 

mosquit0 said:

Jamie,

  Thanks for the reply.  So, all my packages which load flat files to SQL tables will need to be rewritten in SSIS.  For the ones which may sometimes have empty files, will I need to write a script task to check for empty file, or will the SSIS data flow task automatically handle an empty file?  (I don't have access to SSIS yet to play around, just trying to start scoping things out.)  Thanks.

February 15, 2008 17:57
 

jamie.thomson said:

Errr....don't know for 100% certainty but I'm pretty sure this wouldn't cause a problem. It'd be a bug if it does.

-Jamie

February 18, 2008 13:22
 

mosquit0 said:

Thanks for your help Jamie.

February 20, 2008 14:12
 

mosquit0 said:

This is a little off topic but I wasn't sure where to post this.

We installed SQL 2005 Developer edition and it appears that Integration Services is installed and running (if I go to SQL Server Configuration Manager I can see it) but if I go to the BIDS and try to open a new project the only project type listed is Reporting Services.  Also, if I go to SQL Server Management Studio, in the Connect to Server, it is not listed.

Any thoughts on why this would be the case?

The logon is set up as Network Services.  Could that cause the problem?

Thanks.

March 7, 2008 15:37
 

yohan said:

hello jamie

First, thanks for all your work and all those articles

I used to use DTS and I am converting all of my packages to SSIS. (much better by the way, but a little bit more tricky sometimes)

here is my problem, and hopefully, you can answer me

I have one data flow task

source OLEDB (about 40000 rows)

I am using the derived tranformation to change some columns

OLEDB destination to insert to a sql database

on the destination table, I have one column which has this contraint below

" CONSTRAINT [pk_import_item_master] PRIMARY KEY CLUSTERED "

the DefaultBufferMaxRows is 10000

if I execute the package, I am getting an issue of

Cannot insert duplicate key in object + SSIS

I double checked my source table, and I have no duplicated

I noticed that if I changed the DefaultBufferMaxRows to 100000, I have no problem, and the table is correctly filled up

any idea, what can causes this issue ?

I think I need somehow to commit the insert after the buffer is reaching this limit or something like this.

any help will be appreciated

keep the good work !

yohan

March 18, 2008 20:17
 

jamie.thomson said:

Yohan,

That sounds strange.

What I would do is set DataAccessMode='OpenRowset'  on the destination and then divert failed rows to the error output. This will enable you to capture all the failing rows into a file or something and thus enable you to see which is the duplicate. Once you know which value is duplicated you will be in a much better position to investigate why.

-Jamie

March 18, 2008 20:32
 

yohan said:

hi jaime. thx for quick response

I tried that. but as I said, I don't see any duplicated

what I tried, is to use a multicast, to send

1) to the sql table

2) to a flat file

I imported the flatfile, and run a query to check for duplicates (with a buffer of 10000) _> found 58 entries

But if I changed the buffer to 100000, and do the same thing, there is no duplicates

As I said, my guess is that the buffer is overwritten somehow, but to be honest I don't know why ?

I tried to import the same source with the old dts package and no problem

really confused here.

March 18, 2008 21:28
 

yohan said:

I am using fast load

keep identity: uncheck

keep nulls: uncheck

table lock: checked

check constraints: checked

row per batch: blank

maximum insert commit size: 0

don't know if that will help you.

I can send you a copy of my package or screenshot if you want.. ( do you have a private email to send this to)

March 18, 2008 21:31
 

jamie.thomson said:

Sorry Yohan, its a mystery. I recommend you head to the SSIS forum (do a search) and see if anyone ther can help you. I don't have time I'm afraid.

March 18, 2008 21:37
 

yohan said:

hello jamie

After investigation, the issue was with a problem of a wrong mapping in the derived transformation. didn't think about this at all.. I will know next time

thanks anyway for your help and all those good articles

yohan

March 20, 2008 18:47

Leave a Comment

(required) 
(optional)
(required) 
Submit

This Blog

Syndication

News

Powered by Community Server (Personal Edition), by Telligent Systems