blogs.conchango.com

welcome to the conchango blogging site
Welcome to blogs.conchango.com Sign in | Join | Help
in Search

SSIS Junkie

Shifting data in a northern accent

Subscribe to the main Conchango RSS feed at http://blogs.conchango.com/MainFeed.aspx

IM me Get alerted when a new blog entry is posted Windows Live Alerts Translate this page

SSIS: Comparing performance of a raw file against a recordset destination

In January of this year I posted instructions and some code to show how one could use the Recordset destination in conjunction with a source script component to negate the need for a raw file to pass data between data-flows in the same package.

I didn't understand why anyone would want to do this given that raw files are much easier to use and mean that you aren't required to write code although it did occur to me that perhaps people didn't like dropping data to disk rather than having it in memory.

That excuse doesn't really cut any ice with me though so I figured that the only possible reason to use a recordset instead of a raw file would be for reasons of performance. I didn't know which would be the quicker solution so I set out to find out.

I built a package that did the following:

  1. Used a script component to load an arbitrary number of rows into a pipeline. Each row was 108 bytes wide. I copied this component into another data-flow.
  2. In one of those data-flows, pushed the data into a raw file
  3. In the other data-flow, push the data into a recordset
  4. Introduce two new data-flows. One which consumes data from the raw file, one from the in-memory recordset. Both pushed the data into a rowcount component
  5. Had two sequence containers. One to group the two data-flows that populate and extract from a recordset and another to group the two data-flows that populate and extract from a raw file.

Here's a screenshot of the package:

I executed each sequence seperately by disabling the other one. I ran the package from the command-line so as to get as accurate results as possible.

Here are those results

Number of rows     Raw file execution time (X)     Recordset execution time (Y)     X / Y
100000 2.534 9.794     25.87%
200000 4.446 19.548     22.74%
300000 6.479 30.694     21.11%
400000 8.582 38.065     22.55%
500000 10.375 47.579     21.81%
600000 12.377 56.912     21.75%
700000 14.24 66.346     21.46%
800000 16.374 76.4     21.43%
900000 18.156 86.344     21.03%
1000000 20.189 98.432     20.51%

and here they are tabulated:

That's fairly conclusive proof, I think you'll agree, that using a raw file is way way quicker than using a recordset. The comparison in the table above shows that using a recordset takes approximately 5 times as long to execute compared to using a raw file.

What is also interesting to note is that there are significant spikes in memory usage when using a recordset destination. When using a raw file there is hardly a ripple.

 

So there you have it. There is no discernible reason (that I can think of) to EVER consider using a recordset in preference to a raw file. So, don't use them. Simple as that.

If you want to try this out for yourself, download the package from here.

-Jamie

 

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Jim K said:

Would you expect to see the same performance if you subsequently did a bulk insert into a SQL Server table?
June 29, 2006 15:44
 

jamie.thomson said:

Jim,
I was only testing loading of and subsequently extracting from a raw file/recordset. Thereafter the data is in the pipeline so how it got there is irrelevant. Therefore, the performance of the BULK INSERT operation itself should not be affected.

In answer to your question "Would you expect to see the same performance if you subsequently did a bulk insert into a SQL Server table?", I would have to turn that around and say "Why would the destination adapter affect how the data gets into the pipeline in the first place?" Answer: it doesn't!

Why not download the demo package and give it a try!

-Jamie
June 29, 2006 16:04
 

Jim K said:

I have! Let me explain what I meant. I am working on a pair of custom data flow components (a source and a destination) that allow the source and the destination tables to change while the package is running (for example, looping through a list of tables on one server and copying the contents to another one). I am seeing a performance bottleneck that seems to be the bulk inserts back into the destination table. Since I am storing the data from the pipeline into a recordset before doing the bulk insert (using SqlBulkCopy), I was wondering if maybe I would see the same speed differential that you observed if I used a raw file and then figured out how to do a bulk insert directly from the raw file. (Looking at a trace of transactions of SQL Server while using a  SQL Server Destination component, it appears to use a raw file and a bulk insert).
June 29, 2006 17:59
 

jamie.thomson said:

Hi Jim,
Interesting stuff. I didn't state in the post above but the main bottleneck when using a recordset was loading the thing, not extracting from it.

In your use of recordsets I would suggest that yes, this is where your bottleneck is occurring. Have you witnessed high memory utilisation when you do this?

-Jamie

June 30, 2006 10:26
 

Antonio said:

Hi Jamie,

one question:

isn't possible to use Raw File as Source for a Lookup, is it?

thank you

July 25, 2007 11:12
 

jamie.thomson said:

Hi Antonio,

Unfortunately, in SSIS2005, it is not.

-Jamie

July 25, 2007 16:09
 

CozyRoc said:

We have built components for transfering data from one data flow to another data flow in-memory. The performance is on-par the raw data file source/destination solution and has the potential to be even faster because it can be parallelized. Please check Data Flow Source and Data Flow Destination components.

You can download our SSIS library from: http://www.cozyroc.com

November 1, 2007 15:47

Leave a Comment

(required) 
(optional)
(required) 
Submit

This Blog

Syndication

News

Powered by Community Server (Personal Edition), by Telligent Systems