blogs.conchango.com

welcome to the conchango blogging site
Welcome to blogs.conchango.com Sign in | Join | Help
in Search

SSIS Junkie

SSIS Nugget: Extracting data from unstructured files

For this nugget I want to talk about SSIS's powerful abilities to deal with partially structured or unstructured data files.

Sébastien Nunes has posted on the SSIS forum asking how he can load a data file that looks like this (column headings in first row):

Col 1;Col 2;Col 3;Col 4;Col 5
AAAA;BBB;CCC;;
AAA1;BBB1;CCC1;;
AAA2;BBB2;CCC2
AAA3;BBB3;CCC3

into the SSIS pipeline. As you can see there five columns defined in the header row (delimited by a semi-colon) but not all rows contain those five columns.

We can argue all day about whether this is well-structured, partially-structured or unstructured but the main point here is that if you point the SSISFlat File Connection Manager at this file then it will not be able to parse out the columns if you define the delimiter as ";". The screenshots here illustrates how SSIS interpretes it:

This is because SSIS looks for all the expected column delimiters before it looks for the end of line character (CRLF). If it doesn't see them then it will not terminate the row.

So, how to deal with this instead? Well, its quite simple really. We just need to import the entire contents of the file as a single column and then parse out the required columns one the data is in the pipeline.

In this case I used a script component to do the parsing - in most cases the Derived Column component would be sufficient. Here is that code:

Imports System
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper


Public Class ScriptMain
  Inherits UserComponent
  Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
    Row.Column1 = Tokenise(Row.Column0, ";", 1)
    Row.Column2 = Tokenise(Row.Column0, ";", 2)
    Row.Column3 = Tokenise(Row.Column0, ";", 3)
    Row.Column4 = Tokenise(Row.Column0, ";", 4)
    Row.Column5 = Tokenise(Row.Column0, ";", 5)
  End Sub


  'Private function that parses out the columns of the whole row
  Private Function Tokenise(ByVal input As String, ByVal delimiter As String, ByVal token As Integer) As String

    Dim tokenArray As String()
    tokenArray = input.Split(";".ToCharArray) 'Split the string by the delimiter

    If tokenArray.Length < token Then 'Protect against a request for a token that doesn't exist
      Return ""
    Else
      Return tokenArray(token - 1)
    End If

  End Function


End
Class
 
 
 
Not that difficult at all I think you will agree. If you want to have a go at this yourself then download the demo package from here. There is one pre-requisite to running it. You have to put the source file (which is provided in the zip) into "c:\temp".

I hope this demonstrates just how easy it is to handle unstructured data within the SSIS pipeline. Sure, there's a bit of hand-coding to be done but it really isn't that difficult and I think its legitimate to expect that a bit of hand-cracking is necassary when the format is unstructured as this is.

Any questions, let me know!

-Jamie

 

Published 14 July 2006 10:42 by jamie.thomson
Attachment(s): 20060714ImportUnstructuredFile.zip

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Frans van Bree said:

Jamie,

Looks great! Another one to put in my bag :) Ofcourse, I always wear my performance hat, so perhaps a small adjustment could improve it. Currently, for every column, the array-split function is called. If there are say 20 columns and 5 million rows, that would be rather costly.
If you call the tokenize function once per row and have it return the array, you could apply the values subsequently. Something like:

   tokenArray = Tokenize(input, 2)
   Row.Column1 = tokenArray[0]
   Row.Column2 = tokenArray[1]

I would expect it to be quicker.
July 17, 2006 08:22
 

Mark OMeara said:

Excellent tip, quite helpful. Pity it wasn't built into the system.

I think the Tokenise function should be changed slightly to incorporate the delimiter paramter:

e.g.

from:

tokenArray = input.Split(";".ToCharArray)

to:    

tokenArray = input.Split(delimiter.ToCharArray)

January 29, 2007 03:05
 

Dev said:

Hi

Is there any other way to refer to the columns besides the column names , say Row.Item(0) or something?

June 19, 2007 09:59
 

jamie.thomson said:

Dev,

Yeah you can do it. Drag on a script component and add the following method to it:

   Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

       '

       ' Add your code here

       '

       Dim col As IDTSInputColumn90

       Dim cols As IDTSInputColumnCollection90

       cols = Me.ComponentMetaData.InputCollection(0).InputColumnCollection

       MsgBox(cols.Count)

       For Each col In cols

           MsgBox(col.Name)

       Next

   End Sub

-Jamie

June 19, 2007 17:07
 

Johnny said:

Good script I tried it and it works quite well with 5 to 10 columns.

I have 51 columns and having buffer overflow on Row.Column0  (Row.Column0   or Line is a String )

September 11, 2007 17:33
 

Friday the 13th said:

Regarding  referring to the columns besides the column names ..The example was very helpful (I used IDTSOutputColumn90 instead),  but the col.item(i) is read only..

I am trying to also assign value to the output columns, reffering to them by index. Is there a way ?

Thanks

September 29, 2007 06:04
 

Mahesh said:

Hi,

In SSIS,Is there any way of parsing a flat file with repeating headers?

Like,Can i ignore some x no.of rows and pass  y no. of rows,by putting my logic in a loop?

December 7, 2007 10:58
 

jamie.thomson said:

You could but it would be very slow. If I were you I would parse everything as a single row and use a conditional split to get rid of unwanted rows.

-Jamie

December 7, 2007 13:19
 

Paulo said:

Hi, maybe you can help me.

I'm looking for a way to reference columns of a row using a variable.

What I'm trying to accomplish is a function in a script transformation that can write an output _row_ for each input _column_. To this end I think I can reuse the code above, which iterates through columns names and, for each row, find a way to use column names like this:

   [....]    

   For Each col In cols

          if Row.[col.name] <> "string" then 'Access the value of the current column for the current row

             outputbuffer.addrow()

             outputbuffer.attribute = col.name

             outputbuffer.value = Row.[col.name]

          end if

      Next

In another script i have the same problem but with the outputbuffer columns, I would like to access the outputbuffer columns using variables, i.e. calling an output column by name (outputbuffer.[stringwithcolumnname] = value).

Is it possible?

Thanks,

Paulo

February 25, 2008 10:50
 

jamie.thomson said:

Paulo,

Regarding turning columns into rows, this is what the Unpivot transformation does so you should check that out.

Regarding the use of variables like you describe will (I think) require you to reflect on the row. John Welch explains how here: http://agilebi.com/cs/blogs/jwelch/archive/2007/10/21/address-columns-generically-in-a-script-task.aspx

-Jamie

February 25, 2008 14:25
 

Paulo said:

Thanks a lot Jamie!!

Best regards

February 26, 2008 13:19
 

Michelle said:

I have a slightly different problem with quoted delimited columns where some of the columns include the delimiter character, e.g.

Date,Action,Status,UserID,UserFullName

"20080414","Return VALM to","DEAD","KRISHNA1","Krishnana, Anuradha"

"20080414","Return VALM to","DEAD","GOODWIN","Goodwin, Noel"      

"20080414","Resend to Opera","DEAD","BELLG1","Bell, Gavin"

These imported okay in DTS and also preview okay in SSIS when configured with  {CR}{LF} Row Delimiter, {CR}{LF} as Column Delimiter for UserFullname column and TextQualifer as ".

When I run the package it then fails

April 15, 2008 06:21
 

jamie.thomson said:

Michelle,

Why does it fail? What's the error?

-Jamie

April 17, 2008 08:21
 

Pyay said:

Thanks, Jamie.  That's cool..

Mark OMeara,

'If you call the tokenize function once per row and have it return the array, you could apply the values subsequently. Something like:

  tokenArray = Tokenize(input, 2)

  Row.Column1 = tokenArray[0]

  Row.Column2 = tokenArray[1]

I would expect it to be quicker.'

it will be quicker, but it will lost the protection against a request for a token that doesn't exist.

July 8, 2008 17:43
 

Graeme Tout said:

Hi,

I just tried your solution and while the logic works OK, the process keeps failing due to running out of memory, presumably because the script function is writing everything out to memory.

[FF_SRC Input File [643]] Error: The attempt to add a row to the Data Flow task buffer failed with error code 0xC0047020.

Is there a way to make the script write each row directly out to the DB destination, rather than storing it all in memory?

July 11, 2008 03:56
 

jamie.thomson said:

Graeme,

You can write data straight into SQL Server from the script if you like rather than putting it into the pipeline. This isn't my strong suit - its not really anything to do with SSIS - its a VB.Net thing.

It seems strange that you would run out of memory - SSIS is supposed to handle memory for you. Weird.

-Jamie

July 11, 2008 09:58
 

Jim said:

Hey, gr8 article, i having a similar problem but i want to export/create a unstructured flat file , any idea how i can achieve this..

flat file looks like this

a,b,123,34,d,

12,23,34,1,

w,z,

a

Any idea will be greatly appreciated

August 2, 2008 06:06
 

jamie.thomson said:

Jim,

Simply concatenate the values together so that you're outputting a single column of data, basically the opposite to what this blog post explains.

-Jamie

August 5, 2008 09:53
 

Another SSIS Gotcha! — Life with SQL Server said:

October 16, 2008 03:58
 

TheFran said:

How would you go about DELETING the flat file after it is imported?

any help would be great. Thanks.

October 31, 2008 06:58
 

jamie.thomson said:

TheFran,

Use the FileSystem task.

-Jamie

November 2, 2008 10:37
 

Brett Johnson said:

I'm using a tab delimited flat file and get the error that the Value is too large for the column data area of the buffer.  Works fine for mini files but anything over a hundred records I get this error.

Is there a way to increase the buffer size?  

November 26, 2008 21:38
 

jamie.thomson said:

Brett,

Short answer, yes.

Take a read of this: http://www.microsoft.com/technet/prodtechnol/sql/2005/ssisperf.mspx

-Jamie

November 27, 2008 11:23

Leave a Comment

(required) 
(optional)
(required) 
Submit

This Blog

Syndication

News

Powered by Community Server (Personal Edition), by Telligent Systems