Page 1 of 1

Approximate matching

Posted: Thu Jul 02, 2009 4:07 am
by onion2k
I need to write some code to match items in a database even if they're not spelled correctly. My options as far as I can tell are MATCH..AGAINST with a full text index (not keen on that idea), SOUNDEX(), metaphone(), or possibly levenstein(). I'm erring on the side of SOUNDEX() because MySQL has it built in so I can easily create an index of SOUNDEX() values, but I think metaphone() might give me a more accurate match.

Has anyone done this before? What did you use?

More info:

Matching a list of names stored in MySQL.
There are approximately 400,000 names.
Names are between 5 and 80 characters long, stored in a VARCHAR column.
They can include non-English characters (although I'm not too bothered if accuracy is lost on those).

Re: Approximate matching

Posted: Thu Jul 02, 2009 4:29 am
by VladSun
Levenshtein++

Even better - Damerau-Levenshtein.

http://codejanitor.com/wp/2007/02/10/le ... -function/

Indexing will be hard to perform but, I think you can implement an index-like feature by using ON INSERT trigger and precalculated distances.

EDIT: Other useful articles
http://blog.lolyco.com/sean/2008/08/27/ ... positions/
http://blog.lolyco.com/sean/2008/08/06/ ... -distance/

Re: Approximate matching

Posted: Thu Jul 02, 2009 4:36 am
by onion2k
I'm not sure levenstein() is possible to index to be honest. It compares two strings rather than comparing a hash - there's no way to precalculate the distance without knowing what you're comparing against (which I don't, this is a user entered query). With SOUNDEX() and metaphone() I can precalculate the values and then I just need to do a straightforward string comparison for the two hashes.

Using soundex() to find the closest n matches and then using levenstein() might not be a bad approach though.

Re: Approximate matching

Posted: Thu Jul 02, 2009 4:47 am
by VladSun
I'm not sure SOUNDEX, metaphone or any other *sound* related functions can be any useful when searching for misspelled words.

Also:
When using SOUNDEX(), you should be aware of the following limitations:

This function, as currently implemented, is intended to work well with strings that are in the English language only. Strings in other languages may not produce reliable results.

This function is not guaranteed to provide consistent results with strings that use multi-byte character sets, including utf-8.
which is against your requirements.
onion2k wrote:I'm not sure levenstein() is possible to index to be honest. It compares two strings rather than comparing a hash - there's no way to precalculate the distance without knowing what you're comparing against (which I don't, this is a user entered query).
Yes, you are right... I'll have to think up something else :)

Re: Approximate matching

Posted: Thu Jul 02, 2009 5:01 am
by VladSun
Interesting
http://blog.notdot.net/2007/4/Damn-Cool ... 1-BK-Trees
http://hackmap.blogspot.com/2008/05/seq ... ktree.html

So, we onlyneed to implement BK-TREE indexing in MySQL :) hahaha

PS: Seems like PGSQL has done some things :
http://osdir.com/ml/db.postgresql.devel ... 00317.html

Re: Approximate matching

Posted: Thu Jul 02, 2009 5:30 am
by onion2k
Actually.. n-gram matching might have some possibilities. Something to research, cheers.

Re: Approximate matching

Posted: Wed Jul 08, 2009 3:31 am
by VladSun
Any progress? I'm still interested in this topic ;)

Re: Approximate matching

Posted: Wed Jul 08, 2009 3:42 am
by onion2k
Not at the moment. I'm working on the query side at the moment (see my topic about natural language processing).

Re: Approximate matching

Posted: Wed Dec 09, 2009 4:33 pm
by VladSun
VladSun wrote:Any progress? I'm still interested in this topic ;)

Re: Approximate matching

Posted: Thu Dec 10, 2009 2:34 am
by onion2k
Hmm.. I can't remember why I asked this now.

Maybe I was drunk :drunk: