In this blog entry I am going to outline the function of the Column Pattern profile request in the new Data Profiling Task in SQL Server 2008. BOL states about this profile request:
|
Reports a set of regular expressions that cover the specified percentage of values in a string column.
This profile helps you identify problems in your data, such as string that are not valid. This profile can also suggest regular expressions that can be used in the future to validate new values. For example, a pattern profile of a United States Zip Code column might produce the regular expressions: \d{5}-\d{4}, \d{5}, and \d{9}. If you see other regular expressions, your data likely contains values that are not valid or in an incorrect format. |
This sounds like a very useful profile request but its going to take some real analysis after the event to actually make use of it given that it returns regular expressions. I have run it against the [Production].[Product].[ProductNumber] column in the AdventureWorks sample database. That column contains values like:
- HS-2451
- FR-M63B-38
- SH-M897-L
- BK-R19B-58
- MB-2024
- BE-2349
You can see that the values are pretty similar in appearance but that there are also subtle differences between them so it should be interesting to see what this profile request provides.
Here's how I configured the task:
Notice how there are a lot of options and I'm not going to cover them in this blog entry although some of them should be self-explanatory to those that understand regular expressions.
Now the results:
You can see that 4 regular expressions were returned and the distribution of all the rows per regular expression. I'm not going to analyse this any further, you can do that at your leisure.
This concludes my brief summary of the Column Pattern profile request in the Data Profiling Task that is being introduced with SQL Server Integration Services 2008. For more posts in this series please refer to SSIS: Data Profiling Task: Part 1 - Introduction
-Jamie
Disclaimer: The information in this blog post is correct at the time of publication. It may have changed before SQL Server 2008 is released.