blogs.conchango.com

welcome to the conchango blogging site
Welcome to blogs.conchango.com Sign in | Join | Help
in Search

SSIS Junkie

Fuzzyness

SQL Server 2008 is introducing some interesting querying and storage enhancements such as the introduction of spatial data and the FILESTREAM attribute on VARBINARY(MAX) columns and my thoughts have recently turned to what querying enhancements we are going to see in future versions. I was chatting to Rob Farley recently and we started to wonder aloud about whether the database engine might one day support fuzzy querying methods in the engine (Rob spoke about it here).

What do we mean by fuzzy querying? Well I'll try to put this as close to lay terms as possible. Fuzzy querying allows us to compare two things that aren't the same but are a little bit the same. Are you any the wiser? Perhaps an example might suffice.

Take the following list of values:

Brian Smith
Bryan Smith
Brian Smyth
Bryan Smyth
Brian Smythe

Now as far as SQL Server 2008 is concerned those five names are all different, that's just how SQL works. Something either is true or it isn't. Yay or nay. Zero or one. Clearly though we as humans can see there is some sort of similarity between those five names and that is important information to know; its not hard to imagine a class of applications that could make use of this ability to know that things are similar as opposed to being the same. For example, perhaps I am maintaining a customer database for which different telesales operators have the ability to enter data. Its not inconceivable that different people may enter data for the same customer except with slightly different spellings and I am going to want to know about that. Fuzzy querying would allow us to check for these anomalies.

Those of you that have used SSIS 2005 will know that Fuzzy Querying exists in that product as provided by the Fuzzy Lookup and Fuzzy Grouping components however I personally don't want to have to call out to a SSIS package to process data that is already in my database. Rob posted a submission to Connect requesting fuzzy querying behaviour in the database engine so if you think this is an important feature go here and vote for it.

Following on from the basic ability to query on similarity, I wonder if indexing on similarity could be possible? Take the example above. Placing an index on those five values today wouldn't result in them appearing contiguously in the index however if we indexed on similarity...perhaps they would. Something to think about.

 

OK, so that covers off fuzzy querying of textual values. Looking even further into the future I wonder if there are even more complex scenarios that could leverage this inate ability to know that two things are similar. Here's a scenario to consider:

I use a free tool called Windows Live Photo Gallery (WLPG) to manage my personal photo library and one of the features of this tool is the ability to tag photos (WLPG uses SQL Server CE under the covers by the way to manage all of the photo data). For example, I might tag all of the photos in which my fiance, Helen, appears.

Its occurred to me before that a really great feature of WLPG would be auto-tagging. That is, the ability to compare a batch of new photos with all my existing photos and then discover that Helen appears in some of my new photos and tag them appropriately.

Here again we are comparing things (images rather than text this time) based on similarity. It may seem a bit far-fetched based on our pre-conceptions of what a RDBMS does but I see no reason why the ability to compare images to discover whether a person appears in both of them shouldn't be a function of the database rather than a standalone tool. Its clear in SQL Server 2008 that Microsoft are taking the product beyond what we traditionally think a RDBMS should do so the capabilities I'm talking about here seem, to me, to be a logical extension.

What do you think?

-Jamie

P.S. My colleague Peter Hanlon has an interesting perspective on fuzzy querying where he considers the impact that it could have on house buying. Read it here.

Published 21 May 2008 07:57 by jamie.thomson

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Joe Harris said:

Jamie,

Isn't this already provided directly in the engine by the Soundex and Edit Distance functions? I believe that's what the Fuzzy components are using under the covers.

I've personally done some pretty deep name & address cleaning using these functions.

BTW I read somewhere that filestream is the remnants of the aborted WinFS file system for Vista...

Joe

May 21, 2008 14:18
 

jamie.thomson said:

Hey Joe,

You might have read it on my blog post from last Friday :)

http://blogs.conchango.com/jamiethomson/archive/2008/05/16/winfs-lives.aspx

I confess I don't know too much about those functions that you mention and I certianly haven't heard about the fuzzy components in SSIS using them. News to me if they are. Still, even if they are it proves that the fuzzy components are a useful abstraction layer and therefore the argument for having them in the database is still true (IMO).

-Jamie

May 21, 2008 14:27
 

Derek B. Bell said:

If you use SOUNDEX for all of the forms of "Brian" you get the same value - B650.  Similiarly, "Smith" evaluates to S530 in every case.  The potential presence of middle initials would through this off.

Additionally, SOUNDEX doesn't work well is when you have to evaluate cases where abbreviations could be used, so MAIN STREET and MAIN ST don't evaluate to the same set of values.

May 30, 2008 21:12
 

Frans van Bree said:

Alse the CONTAINS predicate is in the direction of what you are looking for.

A smal snipped from BOL:

E. Use CONTAINS with <proximity_term>

This example returns all product names that have the word "Boysenberry" near the word "spread".

USE Northwind

GO

SELECT ProductName

FROM Products

WHERE CONTAINS(ProductName, 'spread NEAR Boysenberry')

GO

F. Use CONTAINS with <generation_term>

This example searches for all products with words of the form dry: dried, drying, and so on.

USE Northwind

GO

SELECT ProductName

FROM Products

WHERE CONTAINS(ProductName, ' FORMSOF (INFLECTIONAL, dry) ')

GO

June 10, 2008 22:10
 

SSIS Junkie said:

June 25, 2008 12:06
 

Govind said:

Fuzzy querying will help working with data accumulated over the years from simpler forms with different spellings of same entity and at least flag them considering set of columns (lastname,firstname,age,gender,address etc). This is super important to have tool while trying to prevent "duplication" while still allowing people to have freedom to have the ability to spell differently. PRetty much every database which needs to prevent duplication (unique index etc will not work).

July 10, 2008 18:42

Leave a Comment

(required) 
(optional)
(required) 
Submit

This Blog

Syndication

News

Powered by Community Server (Personal Edition), by Telligent Systems