blogs.conchango.com

welcome to the conchango blogging site
Welcome to blogs.conchango.com Sign in | Join | Help
in Search

SSIS Junkie

Conchango are busy and need talented consultants in and around London. Interested? Email me or send me a message

SSIS: The Normaliser Component

Scene Setting

Its a common scenario when doing data integration that one has to process datasets containing redundant data. In many cases this is when data is received in flat files and all data is put into a single file for ease of transport. Perhaps this is better explained by way of illustration; below is a fragment of a file containing order details for a retail establishment:

SalesOrderNumberOrderDateProductQuantity
SO4916901/02/2003Road-250 Black, 444
SO4916901/02/2003Road-650 Black, 601
SO4916901/02/2003Road-550-W Yellow, 382
SO4917001/02/2003Mountain-200 Black, 461
SO4917001/02/2003Men's Bib-Shorts, L2

 

Each order is uniquely identified by SalesOrderNumber and has OrderDate as an attribute. In the 5 rows above there are only 2 different orders, SalesOrderNumber SO49169 contains orders for three different products and SalesOrderNumber SO49170 contains orders for two products. Notice though that there is redundant data in this dataset; any data modeler worth their salt would much rather normalise the dataset so that it looked something like this:

OrderIDSalesOrderNumberOrderDate
1SO4916901/02/2003
2SO4917001/02/2003

 

OrderIDProductQuantity
1Road-250 Black, 444
1Road-650 Black, 601
1Road-550-W Yellow, 382
2Mountain-200 Black, 461
2Men's Bib-Shorts, L2

 

Notice that I have introduced a meaningless identifier [OrderID] that links the two normalised datasets together.


Introducing the normaliser component

Are you OK with that theory? OK good. So allow me now to introduce a new component that I've built called the Normaliser Component that does exactly the same as what I did to the dataset above, except that it does it within the SSIS dataflow. You can download the component from here:

 

Very simply this transformation component takes a single input and splits it into two outputs. The first output (called the "master" output) is a distinct list of all the previously redundant data (SalesOrderNumber & OrderDate in the example above) and the second (called the "detail" output) is all the non-redundant data (Product & Quantity in the example above). Both outputs will contain a column of type DT_I4 called [LinkID] that links together data in the two outputs.

Here is a screenshot of a dataflow that I have built that uses the Normaliser Component to split the exact input data seen above into the same two outputs as shown above. Note that I have used a SORT component to sort the data by SalesOrderNumber & OrderDate.

This component is very easy to use because the only configuration it requires is for you to hook up a single input and it will automatically derive its two outputs. The only requirement is that the input is sorted on the redundant data columns; if the input is not sorted then the component will fail to validate at design-time.

[N.B. It is possible to "trick" the Normaliser Component into thinking the input is sorted by manually setting the IsSorted property of the input and the SortKeyPosition property of at least one of the columns in the input. If you do this then the Normaliser Component will still succeed but more fool you because you will not get the results that you require]

 

I should point out that there is an obscure bug in SSIS that might make the component appear as if it is not valid when in fact it is. Read about that bug here. Don't worry about it though, the workaround is simply to detach and reattach the input which takes all of about 4 seconds.

 


Demo

I've recorded a demo so that you can see this working for yourself. I deliberately included installation of the component so you can see how to do that (its very easy). I have embedded the video below but if I were you I'd go and download it from here:

because its a much better viewing experience than in a small embedded player. Up to you of course.

Also, if you want to get hold of the package that I built in the demo then you can get hold of that as well:


Source Code

If you want to see how this is built then you'll be glad to know that I've uploaded the source code to Codeplex. You can get everything you need here: http://www.codeplex.com/NormaliserComponent. If you want to know how to build asynchronous components for SSIS then this should be a big help to you.

The Codeplex project includes the source code for the installer as well which I built using WiX.


Wrap-up

That's about it. I hope the Normaliser Component is useful to you. If you use it then please please please give me feedback. Is it good or bad? Can it be improved? Anything.

 

Thanks

Jamie

 

Update

Turns out there's a bug in the version linked to in this blog entry. Get version 1.1 here: http://blogs.conchango.com/jamiethomson/archive/2007/09/25/SSIS_3A00_-Normaliser-Component-bug-fix.aspx

Published 07 September 2007 20:16 by jamie.thomson

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Jeremy Knapp said:

Neat component.

I will be testing it out. What kind of license goes with this?

September 10, 2007 06:21
 

jamie.thomson said:

Keremy,

You are free to use wherever you wish. Please seek permission before redistributing.

Thanks for taking the time to ask.

Regards

Jamie

September 10, 2007 16:27
 

bhlaws said:

This looks fantastic. I can't wait to use it. I've been looking for something for SSIS similar to the old Lookup functionality of DTS, where I can denormalize data. I'm guessing that this will work for columns other than a primary identifier (such as an order status or name)? I'm guessing I would then need to update my lookup table with any new values and then update the data with the real lookup values. Your code should be very useful in case I want to create a new component which does this in a more streamlined manner.

Thanks, Jamie!

September 10, 2007 16:35
 

jamie.thomson said:

bhlaws,

Yeah it  will work with any column (except DT_DBTIME - http://blogs.conchango.com/jamiethomson/archive/2007/09/05/SSIS_3A00_-PipelineBuffer.Set_2A0028002900_-methods.aspx), the only requirement is that the input is sorted on something.

-Jamie

September 10, 2007 16:42
 

Alex Crockett said:

Hey Jamie,

Big fan of you blog mate, ncie 1 for sharing your discoveries and insight,....

Im testing out the Normaliser but I get an error.... at first I though somehow the download had screwed the .msi so I downloaded again and still got the same error:

Error: 0xC0047062 at Data Flow Task, Normaliser [541]: System.NullReferenceException: Object reference not set to an instance of an object.

  at Jamiet.Ssis.Normaliser.ProcessInput(Int32 inputID, PipelineBuffer buffer)

  at Microsoft.SqlServer.Dts.Pipeline.ManagedComponentHost.HostProcessInput(IDTSManagedComponentWrapper90 wrapper, Int32 inputID, IDTSBuffer90 pDTSBuffer, IntPtr bufferWirePacket)

Any ideas on how I could get around this please?

Regards

Al.

September 12, 2007 10:26
 

jamie.thomson said:

Hey Alex,

Really sorry about this but nevertheless, thanks for letting me know.

I suspect I know what this is. Do you have any NULL values in the sorted columns?

Thanks

Jamie

September 12, 2007 16:29
 

SSIS Junkie said:

17 days ago I released my Normaliser Component but embarrassingly it turns out there was a small bug.

September 25, 2007 05:49
 

jamie.thomson said:

Alex,

If you're still reading then know that your problem has been fixed. Read this: http://blogs.conchango.com/jamiethomson/archive/2007/09/25/SSIS_3A00_-Normaliser-Component-bug-fix.aspx

-Jamie

September 25, 2007 05:51
 

Alex Crockett said:

Hey Jamie

Apologies for the late reply to your message!  However I thought I'd drop by to thank you again as the Normaliser got me out of a hole yesterday!  Thanks again.

Alex.

February 29, 2008 08:58

Leave a Comment

(required) 
(optional)
(required) 
Submit

This Blog

Syndication

News

Powered by Community Server (Personal Edition), by Telligent Systems