In this blog entry I am going to outline the function of the Column Value Distribution profile request in the new Data Profiling Task in SQL Server 2008. BOL states about this profile request:
| Reports all the distinct values in the selected column and the percentage of rows in the table that each value represents. Can also report values that represent more than a specified percentage of rows in the table. This profile helps you identify problems in your data, such as an incorrect number of distinct values in a column. For example, you profile a column that is supposed to contain states in the United States and discover more than 50 distinct values. |
Just like the Column Statistics profile request, I suspect that when people think of data profiling, discovering what distinct values are in a column is something that most commonly comes to mind.
For the purpose of this blog entry I'll use data from the [Sales].[vIndividualCustomer].[CountryRegionName] column in the AdventureWorks sample database.
There is a custom option of this profile request that enables you to specify whether all distinct values should be included in the results or only those that appear frequently enough (the frequency is configurable). You can see those options on the following screenshot:
OK, so that's how you set it up. Here's what the results look like:
Here we get a good summary of the distribution of values in [Sales].[vIndividualCustomer].[CountryRegionName]. Be aware that while writing this I found a bug concerning the of the Column Value Distribution profile request with tables that contain columns of type XML. It should have been fixed by the time the product gets released later this summer but just in case the bug is documented here.
This concludes my brief summary of the Column Value Distribution profile request in the Data Profiling Task that is being introduced with SQL Server Integration Services 2008. For more posts in this series please refer to SSIS: Data Profiling Task: Part 1 - Introduction
-Jamie
Disclaimer: The information in this blog post is correct at the time of publication. It may have changed before SQL Server 2008 is released.