blogs.conchango.com

welcome to the conchango blogging site
Welcome to blogs.conchango.com Sign in | Join | Help
in Search

SSIS Junkie

Conchango are busy and need talented consultants in and around London. Interested? Email me or send me a message

SSIS Nugget: Splitting order detail and order header information from one file into multiple tables

"I have a file that is a list of orders. Each record in the file contains the order number and the item that has been ordered along with information about that item (e.g. Quantity ordered). I want to split the file into an OrderHeader and an OrderDetail table with RI between those tables. How do I do that?"

I have seen the above question posed numerous times, in various derivations, on the SSIS forum. It could also be rephrased as "How do I load a dimension and a fact table when the source data comes from the same place?" Its a very common request. I'll try to articulate the question a bit more here. Basically, people want to turn a flat file like this:

which contains a list of orders (you'll notice that some order numbers are repeated) into this normalised model:

The first thing to realise about this requirement is that it cannot be achieved in a single data-flow. This is because the values in OrderHeader.OrderHeaderID will be generated (either by SSIS or by an IDENTITY) and therefore will not be available for OrderDetail until after the data-flow has completed. Hence, we need a data-flow that loads OrderHeader and a second data-flow that loads OrderDetail. This second data-flow will get the required generated OrderHeader.OrderHeaderID and put it into OrderDetail.OrderHeaderID.

OK, so accepting that we need two data-flows it could be assumed that we need to access the source file twice. In actual fact this is not the case because:

  1. For large amounts of data this would be detrimental to package execution
  2. Repeating work when you don't need to is inefficient, prone to error and inconsistency, and quite simply a bad idea.

The altenative here to extracting the same data twice is to use SSIS raw files.

Here's the steps that you would need to take to accomplish this:

  • Extract data from the file inside a SSIS data-flow
  • Split the data into two data-paths using a MULTICAST transform
  • Get a list of all the distinct order numbers and push into the OrderHeader table (in this example we have used an identity for OrderHeader.OrderHeaderID but we could have very easily just generated an ID in the data-flow)
  • Get all of the order detail records and push them into a raw file.

Steps 1-4 all occur in the first data-flow. Here is a screenshot of that data-flow:

and in the second data-flow:

  • Extract data from the OrderDetail raw file
  • Lookup the newly created OrderHeader.OrderHeaderID
  • Insert data to OrderDetail

Here again is a screenshot of that second data-flow

 

And that's it. Obviously this is a very simplistic example because the text file would likely have much more data in it but the principle remains the same.

The key learning point to take away from this post is that raw files can be very powerful tools in your SSIS arsenal. I use them extensively in my daily work and have, in the past, provided a number of examples of using them:

Lots of examples there that demonstrate just how useful raw files can be!

 

Lastly, if you want to run the package that I built to demo what I have talked about here then download the attachment (link below).

-Jamie

UPDATE 2006-10-23: This could also be achieved by defining priorities to destinations within a data-flow and therfore being able to say "Populate destinationA before destinationB". This feature is called 'Intrinsic Flow Priority' and is not currently available in SSIS. I have requested it at Microsoft Connect and you can see the request here:

Intrinsic Flow Priority
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=178058

I am very keen that this feature makes it into a future version of SSIS so please click on this link and your weight to the request in the form of a comment and a reason why this will be useful to you. Simply voting for it won't make much difference.

 

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Professional Association for SQL Server (PASS) SIG said:

May 31, 2006 02:59
 

Neil S said:

Hi

have you tried using a Merge Join after a multicast to add the newly inserted header identity key values into the detail records?  I return the inserted key as a stored proc output paramater, or I think you could use a non cached lookup after the merge if you dont want to use a stroed proc for the header insert.  either way the merge join will ensure the header record has been processed before the detail records are sent down the path in a single data flow. (you do need to sort both paths though which could be expensive)

hope this makes sense :)

Neil

June 26, 2007 18:05
 

jamie.thomson said:

Neil,

It does make sense and it does seem like a valid method. My only worry would be performance. Your stored proc would only insert a record at a time and because its a stored proc tat's alot of processing for just one record. You also have an expensive merge join (plus expensive sorts).

If keeping everything in a single executable is important to you and you don't mind having procedural logic outside of your package then this will work fine. Me? I don't mind more executables - more executables does not have to mean a longer execution time.

Thanks for the comment Neil.

-Jamie

June 26, 2007 18:28
 

Subrat said:

Hi Jamie,

Im really confused. Please help me getting clarified. What does the raw file destination in the first data flow task contains? Does it have the identity column - orderheaderID values? I feel it wont have that. If it doesnt have the identity values, then how in the 2nd data floew task look up is happening.

In my case, i have the ID in 1st table(identity column). The same ID will be used to populate a 2nd table.

Please help me.

Regards.

April 20, 2008 07:38
 

jamie.thomson said:

Subrat,

No, it does not contain the identity column. It contains the business key that can be used by the Lookup component in the second dataflow in order to get the newly generated identity.

-Jamie

April 20, 2008 17:44
 

Subrat said:

Hi Jamie,

Thanks for ur response.

The problem which i'm facing is that i use RAW FILE DESTINATION to store the data in DATAFLOW1.

The RAW FILE DESTINATION in DATAFLOW1 is used in DATAFLOW2 as RAW FILE SOURCE.

But i cant build my package as a single entity, as in DF2 im unable to configure RAW FILE SOURCE without the raw file actually existing.

So is it not possible to build a single package having both raw file source and destination and run the package in one shot.

I have to give the package to client and they wud be running it. They want it 100% ready and they will just click the execute button.

Please suggest me how to overcome this problem.

Regards,

Subrat

April 27, 2008 08:31
 

jamie.thomson said:

April 27, 2008 19:40

Leave a Comment

(required) 
(optional)
(required) 
Submit

This Blog

Syndication

News

Powered by Community Server (Personal Edition), by Telligent Systems