blogs.conchango.com

welcome to the conchango blogging site
Welcome to blogs.conchango.com Sign in | Join | Help
in Search

SSIS Junkie

Conchango are busy and need talented consultants in and around London. Interested? Email me or send me a message

SSIS: Give your transactional systems a break - use raw files!

One of the great selling points of ETL tools such as SSIS is that is is possible to move data from source to destination without "touching disk" as they say. By this I mean that data can be extracted from source and inserted into a destination without having to stage it anywhere first. Transformations and aggregations can be applied to the data in transit without having to rely on another engine (such as a relational database) to do these calculations for you.

That's really powerful functionality but it does present a few problems:

  • Debugging is difficult. You have to rely on the in-built debugging tools which as yet are nowhere near fully matured.
  • During debugging you have to re-open a connection to a source system in order to get the data to be debugged. This can be a major major problem if you have to access live transactional systems during the working day.
  • Back pressure from components in the pipeline can cause connections to systems of record to be held open longer than they should ordinarily have to be.
  • If it takes a long time to get data from source then this will increase the amount of time that you have a connection open to the destination (This is the opposite problem to the previous bullet point. Sort of.)
  • If you are using SSIS transactions, Distributed Transaction Coordinator (DTC) will try to enlist the source in a transaction and this may not be what you want.

Staging data can mitigate these problems. Staging most often means inserting data into a relational database table but with SSIS that need not be the case. SSIS provides a proprietary binary format file that offers super-fast loading into and unloading from the pipeline. They are called raw files.

In the SSIS work that I am currently doing on a day-to-day basis I have developed a new course of action for extracting data from systems of record. In order to avoid the problems listed above I drop the data from source directly into a raw file and then use that raw file as the source for all of my transformations and aggregations.

 

The big advantage that I find is when problems occur, I don't have to go back to the system of record to get the data I need to investigate. Its available in a local (thus faster) proprietary format and I don't have the DBAs of the source systems moaning at me for constantly hitting their systems.. Even if you don't want to go into production with a package that does this there is nothing wrong with holding a local copy of the data in a raw file for development purposes and then at a later date replace your raw file source adapter with the original OLE DB Source adapter. Its as simple as it looks in this screenshot (i.e. very):

I'm not saying that you should adopt the same approach but it is worth considering. My productivity has increased greatly since I started doing this.

-Jamie

UPDATE, 2008-05-19: I've just started on a project where I'm using SSIS on a 64bit machine to extract data from an Excel file. Some of you may know that in this situation you cannot run the package from BIDS because there is no 64bit Excel OLE DB provider, instead you have to use the 32bit version of dtexec.

So, if you follow the technique explained above you can dump your data from the Excel file into a raw file and use 'Execute Task' within BIDS. Happy days!

 

Published 12 December 2006 17:36 by jamie.thomson

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

roy ashbrook said:

OK, so several times recently I have heard about SSIS being able to load 1TB in an hour. These comments

May 5, 2007 00:16
 

Michael Entin's WebLog said:

One of the mechanisms that SSIS data flow engine utilizes to achieve high performance is “back pressure”.

June 13, 2007 03:45
 

Jim said:

I'm in the camp of 'Extract to a staging database'. One reason I like this is that during troubleshooting, I can easily see what was extracted (via SQL).

The Raw file sounds interesting from a performance standpoint, but how do you interrogate it? I'm imagining you would need either data viewers, but that requires running the package to look at the data, and can't be queried. Is there another way?

If not, I would see that as a significant disadvantage to Raw files, no?

September 13, 2007 16:00
 

jamie.thomson said:

Hi Jim,

Great point. It is (sort of) possible to interrogate raw files by using SImon Sabin's raw file viewer: http://sqlblogcasts.com/blogs/simons/archive/2007/01/11/SSIS-Rawfile-viewer---now-available.aspx

Its certainy not as useful as a staging DB tho.

-Jamie

September 13, 2007 23:16
 

eB said:

I stumbled upon the same approach while working with a clients database and giant extracts and trying to not piss off their DBA's.  I wanted to do more research to be sure the approach was sound and your post is great confirmation that I'm on the right track.

Quick question though... I'm no SSIS expert so hopefully you can give me good insight here.  What if you were to use 2 RAW files.  The first to dump data from the source quickly, and then place all of the transformations between the first RAW file and a second dump.  Then load into the targe environment from the second RAW file.  Is there any benefit here as far as possibly reducing connection time to the target server while records flow through the pipe?  Is this a valid idea to improve performance or is it pointless?

Thanks!

December 28, 2007 07:24
 

jamie.thomson said:

eB,

Sounds like it could be a good idea under the right circumstances. As ever, it depends on your situation. There is certainly nothing wrong with doing this.

-Jamie

December 28, 2007 17:06
 

Dmitry said:

Very interesting, i never realized this is what Raw Files were designed to do, thank you for opening my eyes.

January 18, 2008 18:05
 

chockenhull said:

I'm having trouble using the raw file data passed the rowcount.  One of the fields in the raw source it the PK to other files I want to collect and dump out to txt format.  Problem is I can't figure out how to loop through the raw file, set the needed field into a variable to be used by SQL and other tasks in the loop.  There are so many external and output column def but nowhere to assign the output to a User::Variable.  All the samples I've been able to find deal with timers or file processing, which have not been helpful for me.

February 11, 2008 20:32
 

jamie.thomson said:

chockenhull,

That's kind of a different problem but nevertheless I'll try and address it. If you want to loop over the contents of a raw file then use a recordset destination in your dataflow and loop over the contents of that recordset using conventional methods in the ForEach loop.

-Jamie

February 11, 2008 21:03
 

XL said:

Hi Jamie,

Is there anyone out there who wrote a Raw file lookup component?

Cheers,

XL

March 12, 2008 03:52
 

jamie.thomson said:

XL,

If they have, I don't know about it. It'd be nice wouldn't it?

A workaround might be to drop the data into a flat file (e.g. CSV) and then use something like:

SELECT a.* FROM OPENROWSET( BULK 'c:\test\values.txt', FORMATFILE = 'c:\test\values.fmt') AS a;

to populate the Lookup cache. Not perfect but it'll work.

If it helps at all, what you are after will be eminently possible in SSIS2008. I suspect not.

-Jamie

March 12, 2008 04:06

Leave a Comment

(required) 
(optional)
(required) 
Submit

This Blog

Syndication

News

Powered by Community Server (Personal Edition), by Telligent Systems