Approximate matching
Moderator: General Moderators
Approximate matching
I need to write some code to match items in a database even if they're not spelled correctly. My options as far as I can tell are MATCH..AGAINST with a full text index (not keen on that idea), SOUNDEX(), metaphone(), or possibly levenstein(). I'm erring on the side of SOUNDEX() because MySQL has it built in so I can easily create an index of SOUNDEX() values, but I think metaphone() might give me a more accurate match.
Has anyone done this before? What did you use?
More info:
Matching a list of names stored in MySQL.
There are approximately 400,000 names.
Names are between 5 and 80 characters long, stored in a VARCHAR column.
They can include non-English characters (although I'm not too bothered if accuracy is lost on those).
Has anyone done this before? What did you use?
More info:
Matching a list of names stored in MySQL.
There are approximately 400,000 names.
Names are between 5 and 80 characters long, stored in a VARCHAR column.
They can include non-English characters (although I'm not too bothered if accuracy is lost on those).
Re: Approximate matching
Levenshtein++
Even better - Damerau-Levenshtein.
http://codejanitor.com/wp/2007/02/10/le ... -function/
Indexing will be hard to perform but, I think you can implement an index-like feature by using ON INSERT trigger and precalculated distances.
EDIT: Other useful articles
http://blog.lolyco.com/sean/2008/08/27/ ... positions/
http://blog.lolyco.com/sean/2008/08/06/ ... -distance/
Even better - Damerau-Levenshtein.
http://codejanitor.com/wp/2007/02/10/le ... -function/
Indexing will be hard to perform but, I think you can implement an index-like feature by using ON INSERT trigger and precalculated distances.
EDIT: Other useful articles
http://blog.lolyco.com/sean/2008/08/27/ ... positions/
http://blog.lolyco.com/sean/2008/08/06/ ... -distance/
Last edited by VladSun on Thu Jul 02, 2009 4:39 am, edited 1 time in total.
There are 10 types of people in this world, those who understand binary and those who don't
Re: Approximate matching
I'm not sure levenstein() is possible to index to be honest. It compares two strings rather than comparing a hash - there's no way to precalculate the distance without knowing what you're comparing against (which I don't, this is a user entered query). With SOUNDEX() and metaphone() I can precalculate the values and then I just need to do a straightforward string comparison for the two hashes.
Using soundex() to find the closest n matches and then using levenstein() might not be a bad approach though.
Using soundex() to find the closest n matches and then using levenstein() might not be a bad approach though.
Re: Approximate matching
I'm not sure SOUNDEX, metaphone or any other *sound* related functions can be any useful when searching for misspelled words.
Also:

Also:
which is against your requirements.When using SOUNDEX(), you should be aware of the following limitations:
This function, as currently implemented, is intended to work well with strings that are in the English language only. Strings in other languages may not produce reliable results.
This function is not guaranteed to provide consistent results with strings that use multi-byte character sets, including utf-8.
Yes, you are right... I'll have to think up something elseonion2k wrote:I'm not sure levenstein() is possible to index to be honest. It compares two strings rather than comparing a hash - there's no way to precalculate the distance without knowing what you're comparing against (which I don't, this is a user entered query).
There are 10 types of people in this world, those who understand binary and those who don't
Re: Approximate matching
Interesting
http://blog.notdot.net/2007/4/Damn-Cool ... 1-BK-Trees
http://hackmap.blogspot.com/2008/05/seq ... ktree.html
So, we onlyneed to implement BK-TREE indexing in MySQL
hahaha
PS: Seems like PGSQL has done some things :
http://osdir.com/ml/db.postgresql.devel ... 00317.html
http://blog.notdot.net/2007/4/Damn-Cool ... 1-BK-Trees
http://hackmap.blogspot.com/2008/05/seq ... ktree.html
So, we onlyneed to implement BK-TREE indexing in MySQL
PS: Seems like PGSQL has done some things :
http://osdir.com/ml/db.postgresql.devel ... 00317.html
There are 10 types of people in this world, those who understand binary and those who don't
Re: Approximate matching
Actually.. n-gram matching might have some possibilities. Something to research, cheers.
Re: Approximate matching
Any progress? I'm still interested in this topic 
There are 10 types of people in this world, those who understand binary and those who don't
Re: Approximate matching
Not at the moment. I'm working on the query side at the moment (see my topic about natural language processing).
Re: Approximate matching
VladSun wrote:Any progress? I'm still interested in this topic
There are 10 types of people in this world, those who understand binary and those who don't
Re: Approximate matching
Hmm.. I can't remember why I asked this now.
Maybe I was drunk
Maybe I was drunk