Welcome to blogs.conchango.com Sign in | Join | Help

Welcome to blogs.conchango.com

SSIS Junkie

SSIS Nugget: Splitting order detail and order header information from one file into multiple tables

"I have a file that is a list of orders. Each record in the file contains the order number and the item that has been ordered along with information about that item (e.g. Quantity ordered). I want to split the file into an OrderHeader and an OrderDetail table with RI between those tables. How do I do that?"

I have seen the above question posed numerous times, in various derivations, on the SSIS forum. It could also be rephrased as "How do I load a dimension and a fact table when the source data comes from the same place?" Its a very common request. I'll try to articulate the question a bit more here. Basically, people want to turn a flat file like this:

which contains a list of orders (you'll notice that some order numbers are repeated) into this normalised model:

The first thing to realise about this requirement is that it cannot be achieved in a single data-flow. This is because the values in OrderHeader.OrderHeaderID will be generated (either by SSIS or by an IDENTITY) and therefore will not be available for OrderDetail until after the data-flow has completed. Hence, we need a data-flow that loads OrderHeader and a second data-flow that loads OrderDetail. This second data-flow will get the required generated OrderHeader.OrderHeaderID and put it into OrderDetail.OrderHeaderID.

OK, so accepting that we need two data-flows it could be assumed that we need to access the source file twice. In actual fact this is not the case because:

  1. For large amounts of data this would be detrimental to package execution
  2. Repeating work when you don't need to is inefficient, prone to error and inconsistency, and quite simply a bad idea.

The altenative here to extracting the same data twice is to use SSIS raw files.

Here's the steps that you would need to take to accomplish this:

  • Extract data from the file inside a SSIS data-flow
  • Split the data into two data-paths using a MULTICAST transform
  • Get a list of all the distinct order numbers and push into the OrderHeader table (in this example we have used an identity for OrderHeader.OrderHeaderID but we could have very easily just generated an ID in the data-flow)
  • Get all of the order detail records and push them into a raw file.

Steps 1-4 all occur in the first data-flow. Here is a screenshot of that data-flow:

and in the second data-flow:

  • Extract data from the OrderDetail raw file
  • Lookup the newly created OrderHeader.OrderHeaderID
  • Insert data to OrderDetail

Here again is a screenshot of that second data-flow

 

And that's it. Obviously this is a very simplistic example because the text file would likely have much more data in it but the principle remains the same.

The key learning point to take away from this post is that raw files can be very powerful tools in your SSIS arsenal. I use them extensively in my daily work and have, in the past, provided a number of examples of using them:

Lots of examples there that demonstrate just how useful raw files can be!

 

Lastly, if you want to run the package that I built to demo what I have talked about here then download the attachment (link below).

-Jamie

UPDATE 2006-10-23: This could also be achieved by defining priorities to destinations within a data-flow and therfore being able to say "Populate destinationA before destinationB". This feature is called 'Intrinsic Flow Priority' and is not currently available in SSIS. I have requested it at Microsoft Connect and you can see the request here:

Intrinsic Flow Priority
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=178058

I am very keen that this feature makes it into a future version of SSIS so please click on this link and your weight to the request in the form of a comment and a reason why this will be useful to you. Simply voting for it won't make much difference.

 

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Professional Association for SQL Server (PASS) SIG said:

May 31, 2006 02:59
 

Neil S said:

Hi

have you tried using a Merge Join after a multicast to add the newly inserted header identity key values into the detail records?  I return the inserted key as a stored proc output paramater, or I think you could use a non cached lookup after the merge if you dont want to use a stroed proc for the header insert.  either way the merge join will ensure the header record has been processed before the detail records are sent down the path in a single data flow. (you do need to sort both paths though which could be expensive)

hope this makes sense :)

Neil

June 26, 2007 18:05
 

jamie.thomson said:

Neil,

It does make sense and it does seem like a valid method. My only worry would be performance. Your stored proc would only insert a record at a time and because its a stored proc tat's alot of processing for just one record. You also have an expensive merge join (plus expensive sorts).

If keeping everything in a single executable is important to you and you don't mind having procedural logic outside of your package then this will work fine. Me? I don't mind more executables - more executables does not have to mean a longer execution time.

Thanks for the comment Neil.

-Jamie

June 26, 2007 18:28
 

Subrat said:

Hi Jamie,

Im really confused. Please help me getting clarified. What does the raw file destination in the first data flow task contains? Does it have the identity column - orderheaderID values? I feel it wont have that. If it doesnt have the identity values, then how in the 2nd data floew task look up is happening.

In my case, i have the ID in 1st table(identity column). The same ID will be used to populate a 2nd table.

Please help me.

Regards.

April 20, 2008 07:38
 

jamie.thomson said:

Subrat,

No, it does not contain the identity column. It contains the business key that can be used by the Lookup component in the second dataflow in order to get the newly generated identity.

-Jamie

April 20, 2008 17:44
 

Subrat said:

Hi Jamie,

Thanks for ur response.

The problem which i'm facing is that i use RAW FILE DESTINATION to store the data in DATAFLOW1.

The RAW FILE DESTINATION in DATAFLOW1 is used in DATAFLOW2 as RAW FILE SOURCE.

But i cant build my package as a single entity, as in DF2 im unable to configure RAW FILE SOURCE without the raw file actually existing.

So is it not possible to build a single package having both raw file source and destination and run the package in one shot.

I have to give the package to client and they wud be running it. They want it 100% ready and they will just click the execute button.

Please suggest me how to overcome this problem.

Regards,

Subrat

April 27, 2008 08:31
 

jamie.thomson said:

April 27, 2008 19:40
 

David said:

I see that your example presupposes that the input file has a key value in it (OrderNumber).  How do you handle a situation where the key value isn't specified?  For example I have a list of name and addresses that I need to populate into a couple of tables (Contacts, Contact Address, etc.)  Once the Contact records are created, how do I reference the IDENTITY values that are associated with the new records so that I can maintain RI between the Contact and Address tables?

August 19, 2008 23:19
 

jamie.thomson said:

David,

There needs to be some sort of unique identifier (which I often refer to as the natural key) for the contact. In the example above the natural key is the Order Number, for your scenario the natural key might be the contact's name, although you will of course run into problems if 2 people have the same name.

-Jamie

August 20, 2008 08:27
 

Mark said:

Would this approach work well if there was a one-to-one association between OrderHeader and OrderDetail?

I am having a problem where I want to break up the input records into two paths. Each path stores some of the data into a different table, with a key shared between them.

The problem is if the insertion succeeds in the first table, but fails in the second table. Ideally, we want a transaction-style operation, where if any insert fails, the inserts (on the same key) should fail.

That is, I don't want to fail the entire *batch* - just the make sure that for every key in A, there exists a matching key in B (and vice versa).

It appears that the raw file approach won't work here... but I'd be grateful for any input.

October 29, 2008 07:17
 

jamie.thomson said:

Mark,

It will work equally well when the relationship is 1-to-1.

The problem you have is a different one though. You're talking about a row-by-row transaction, that means that you would have to process each row in turn (probably using a ForEach loop to loop over your source dataset).

of course, this will be EXTREMELY inefficient.

Another option is to write a stored procedure that does all the work for you. You could perhaps call that stored procedure from the OLE DB Command component, passing in the relevant values as arguments. This will still be a row-by-row operation (i.e. inefficient) but at least you wouldn't be executing a data-flow per row.

Another option might be to have an INSERT trigger on tableA that inserts into tableB but I can't see that working because not all the tavlues for tableB would exist in tableA (otherwise why would you be doing this)

Hope that's given you some food for thought.

-Jamie

October 29, 2008 12:55
 

Mark said:

Thanks for the feedback Jamie.

I suspected as much -- the row-by-row transaction is always going to be inefficient. I'm currently considering at making a custom destination component. Our particular circumstances is actually quite complicated, requiring lots of transformation, exploding a 1-1 into a 1-many relation. doing file copies - all within a per-row transaction.

I still want to leverage the power of SSIS for transformations and configuration, so (although the performance will be miserable), I will have a shot at making my own multiple-input asynchronous destination component, which can manually squash the operations into a DB transaction

October 31, 2008 07:00
 

kitchu said:

hi Can we read mails from a particular mail box and extract the attachments from each mail which are received using SSIS.if yes can you please help.

November 26, 2008 09:40

Leave a Comment

(required) 
(optional)
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Personal Edition), by Telligent Systems