Approximate matching

Questions about the MySQL, PostgreSQL, and most other databases, as well as using it with PHP can be asked here.

Moderator: General Moderators

Post Reply
User avatar
onion2k
Jedi Mod
Posts: 5263
Joined: Tue Dec 21, 2004 5:03 pm
Location: usrlab.com

Approximate matching

Post by onion2k »

I need to write some code to match items in a database even if they're not spelled correctly. My options as far as I can tell are MATCH..AGAINST with a full text index (not keen on that idea), SOUNDEX(), metaphone(), or possibly levenstein(). I'm erring on the side of SOUNDEX() because MySQL has it built in so I can easily create an index of SOUNDEX() values, but I think metaphone() might give me a more accurate match.

Has anyone done this before? What did you use?

More info:

Matching a list of names stored in MySQL.
There are approximately 400,000 names.
Names are between 5 and 80 characters long, stored in a VARCHAR column.
They can include non-English characters (although I'm not too bothered if accuracy is lost on those).
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Approximate matching

Post by VladSun »

Levenshtein++

Even better - Damerau-Levenshtein.

http://codejanitor.com/wp/2007/02/10/le ... -function/

Indexing will be hard to perform but, I think you can implement an index-like feature by using ON INSERT trigger and precalculated distances.

EDIT: Other useful articles
http://blog.lolyco.com/sean/2008/08/27/ ... positions/
http://blog.lolyco.com/sean/2008/08/06/ ... -distance/
Last edited by VladSun on Thu Jul 02, 2009 4:39 am, edited 1 time in total.
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
onion2k
Jedi Mod
Posts: 5263
Joined: Tue Dec 21, 2004 5:03 pm
Location: usrlab.com

Re: Approximate matching

Post by onion2k »

I'm not sure levenstein() is possible to index to be honest. It compares two strings rather than comparing a hash - there's no way to precalculate the distance without knowing what you're comparing against (which I don't, this is a user entered query). With SOUNDEX() and metaphone() I can precalculate the values and then I just need to do a straightforward string comparison for the two hashes.

Using soundex() to find the closest n matches and then using levenstein() might not be a bad approach though.
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Approximate matching

Post by VladSun »

I'm not sure SOUNDEX, metaphone or any other *sound* related functions can be any useful when searching for misspelled words.

Also:
When using SOUNDEX(), you should be aware of the following limitations:

This function, as currently implemented, is intended to work well with strings that are in the English language only. Strings in other languages may not produce reliable results.

This function is not guaranteed to provide consistent results with strings that use multi-byte character sets, including utf-8.
which is against your requirements.
onion2k wrote:I'm not sure levenstein() is possible to index to be honest. It compares two strings rather than comparing a hash - there's no way to precalculate the distance without knowing what you're comparing against (which I don't, this is a user entered query).
Yes, you are right... I'll have to think up something else :)
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Approximate matching

Post by VladSun »

Interesting
http://blog.notdot.net/2007/4/Damn-Cool ... 1-BK-Trees
http://hackmap.blogspot.com/2008/05/seq ... ktree.html

So, we onlyneed to implement BK-TREE indexing in MySQL :) hahaha

PS: Seems like PGSQL has done some things :
http://osdir.com/ml/db.postgresql.devel ... 00317.html
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
onion2k
Jedi Mod
Posts: 5263
Joined: Tue Dec 21, 2004 5:03 pm
Location: usrlab.com

Re: Approximate matching

Post by onion2k »

Actually.. n-gram matching might have some possibilities. Something to research, cheers.
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Approximate matching

Post by VladSun »

Any progress? I'm still interested in this topic ;)
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
onion2k
Jedi Mod
Posts: 5263
Joined: Tue Dec 21, 2004 5:03 pm
Location: usrlab.com

Re: Approximate matching

Post by onion2k »

Not at the moment. I'm working on the query side at the moment (see my topic about natural language processing).
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Approximate matching

Post by VladSun »

VladSun wrote:Any progress? I'm still interested in this topic ;)
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
onion2k
Jedi Mod
Posts: 5263
Joined: Tue Dec 21, 2004 5:03 pm
Location: usrlab.com

Re: Approximate matching

Post by onion2k »

Hmm.. I can't remember why I asked this now.

Maybe I was drunk :drunk:
Post Reply