In this blog entry I am going to outline the function of the Functional Dependency profile request in the new Data Profiling Task in SQL Server 2008. BOL states about this profile request:
|
Reports the extent to which the values in one column (the dependent column) depend on the values in another column or set of columns (the determinant column).
This profile also helps you identify problems in your data, such as values that are not valid. For example, you profile the dependency between a column that contains United States Zip Codes and a column that contains states in the United States. The same Zip Code should always have the same state, but the profile discovers violations of this dependency. |
I had difficulty in finding data in the AdventureWorks sample database that would adequately demonstrate the use of this profile request so I had to create my own using the following:
select ROW_NUMBER() OVER (partition by sp.Name order by a.PostalCode) RowNum
, a.PostalCode
, sp.Name AS StateProvinceName
, sp.CountryRegionCode
into Addresses
from Person.[Address] a
inner join Person.StateProvince sp
on a.StateProvinceID = sp.StateProvinceID
update Addresses
set CountryRegionCode = 'ZZ'
where RowNum = 1
Here's what that gives us:
Take a look at the last red box in that image above which shows information for postal codes in California. Notice how there is now no longer any consistency between [StateProvinceName] and [CountryRegionCode] as you would ordinarily expect.
So next stop is to configure the Data Profiling Task to look at this data:
As you can see I've configured the task to check the table that I created earlier. There are a number of other options that can be set but I'm not going to cover them here. Running the task and then loading the results into Data Profile Viewer produces the following output:
Some things to note:
- Overall functional dependency strength is 99.65%. If I hadn't changed any values then this figure would be 100% and given that I only changed 74 out of 18614 rows the near-optimal functional dependency strength isn't surprising.
- There are 4564 records for California of which 4563 have [CountryRegionCode]='US'. The remaining 1 record that I set [CountryRegionCode]='ZZ' causes the functional dependency strength for California to be 99.978%.
- The task returns a functional dependency strength for each determinant column.
I see examples all the time in my everyday work where data inconsistencies like this rear their head so I can see real value in this type of analysis.
This concludes my brief summary of the Functional Dependency profile request in the Data Profiling Task that is being introduced with SQL Server Integration Services 2008. For more posts in this series please refer to SSIS: Data Profiling Task: Part 1 - Introduction
-Jamie
Disclaimer: The information in this blog post is correct at the time of publication. It may have changed before SQL Server 2008 is released.