This is a post I've been meaning to write for some time but as you may have noticed my blogging output has been pretty lax lately.
During my activity on the SSIS forum I've noticed that much of the content is in regard to the dataflow task and that's not a surprise given that its the most useful tool in the SSIS box and also the most complex. This post is me brainstorming some of the stuff that I know about the dataflow and hopefully it proves useful to some of you.
- Buffer Architecture. If I'm ever interviewing you for a job as a SSIS developer you can lay a lot of money to say that I'll ask you to tell me what a buffer is. Buffers are fundamental to the dataflow - they are what the dataflow uses to move data around. A buffer is essentially an area of memory and by default consists of approximately 10000 rows (usually slightly less than that) and that's why when you execute a dataflow within BIDS the row counts on the data paths go up in approximate increments of 10000. Performance tuning a SSIS dataflow is essentially all about manipulating various properties until you find the optimum number of rows in each buffer and you can read more (much more) about that here.
- Dataflows contain components which are generally categorised into synchronous and asynchronous. The most definitive description of these is that the output from a synchronous component uses the same buffer as the input; asynchronous components create a new buffer for their output. All source adapters are asynchronous components, all destination adapters are synchronous. Synchronous components are generally quicker than asynchronous components.
- Asynchronous components are further categorised as partially-blocking or fully-blocking. Fully-blocking components require all rows from upstream before they put any data into the output; partially-blocking components will start to output data before they receive all upstream rows.
- Execution trees. Each asynchronous component creates what is called an execution tree in the dataflow. In SSIS 2005 (but not in SSIS 2008) each execution tree uses one execution thread so another part of performance tuning is to fully utilise all processors on your hardware. Read more here.
- OnPipelineRowsSent. All executables in a SSIS package throw events and one of the events throws by the dataflow is OnPipelineRowsSent. When a component outputs a buffer of data then it throws a OnPipelineRowsSent event and thus enables us to know how many rows each component has processed. When you execute a dataflow within BIDS these events are consumed and are used to change the rowcounts that you see increasing as more rows are processed.
- Spooling. I said earlier that all buffers are a space in memory but of course memory is finite so if there is more data in the pipeline than can fit in memory then buffers will get spooled to disc. The location on disc is defined by BLOBTempStoragePath & BufferTempStoragePath. Spooling will severely impact dataflow performance so avoid if possible.
- A lot of people ask if its possible to remove columns from the dataflow once they have finished using them. For example, if columns called [FirstName] & [LastName] are concatenated together to make [FullName] its likely that those two columns won't be needed anymore. The simple answer though is no. Once the data is in memory it would be an overhead to remove the data and "squeeze" the buffer up to make it slower which is why those columns still appear downstream. This is nothing to be concerned about - its highly highly unlikely they are heavily impacting performance. Of course, if an asynchronous component is encountered then a new buffer will be created on the output and the unrequired columns will (probably) be removed. This issue is further discussed here.
- Following on from the previous point...its intuitive to think that columns that begin at a component don't exist prior to the data being processed by that component. In fact that's not true. Prior to dataflow execution the execution plan for a dataflow is determined and it is at that point that all columns are defined and thus created. So, all columns that will be used in a buffer exist even before the buffer gets any data.
- The datatypes of columns in the dataflow are different from datatypes used for SSIS variables. To this day I don't understand why the SSIS team opted to use different datatypes in the cotrol flow and data flow and I hope this changes one day.
AllThe stock components (i.e. those provided out-of-the-box) are all mostly written in native code. SSIS provides a .Net API that enables you and I to build our own components and hence it is tempting to think that these custom components won't work as quickly as stock components. This is probably true but really the difference is negligible. The majority of the work (validation, memory management, buffer editing etc...) is done by native code so you're not going to suffer severe performance problems by implementing custom components. - The BLOB data types (i.e. DT_TEXT, DT_NTEXT, DT_IMAGE) can severely impact dataflow performance so try and avoid them if you can.
- Raw files can be used to pass data from one dataflow to another - even if those dataflows are in different packages. Raw files have a proprietary file format that is essentially a match of the data in memory and hence reading to and writing from them is extremely quick. people often seem reticent to place data into raw files but I don't hesitate to recommend using them if you need to.
- Components in the dataflow are joined together by data paths. There is an important property on each data path called IsSorted. A lot of people think that setting this property to TRUE will cause the data in that dataflow to be sorted. That's not true - this property only tells the dataflow that the data is sorted, nothing more. If you set this property to TRUE and the data is not sorted then you may be creating problems for yourself later on.
- Source and destination adapters maintain external column collections which are used to store the metadata of the external data sinks that those adapters connect to. There are two reasons for this as far as I can determine. Firstly to enable offline development (a big criticism of DTS was that offline development wasn't possible) and secondly to enable the dataflow to validate itself. More information here.
- Although it appears in BIDS as though the data in a buffer "moves" from one component to another that isn't actually the case. Data in a buffer doesn't actually move about in memory. Someone once used an analogy of cars travelling on a road to describe this. The buffers are analogous to cars on the road and milestones along the road are analogous to the components. Instead of thinking of the cars moving along the road to reach the milestones, think of the cars as being stationery and the road moving along underneath the cars.
I'll probably add to this post over time as new things occur to me. In the meantime if you want a more detailed description of how the dataflow works then Kirk Haselden's book has a whole chapter devoted to it. You can also pose questions in the comments although I'd urge you to post questions to the SSIS forum where more people will be available to answer and where your question may already have been answered. Hopefully my fellow MVPs will follow up with some insights of their own as well (if they're reading).
In a future post I'll provide more specific information about the out-of-the-box components.
-Jamie