spelling algorithms
Posted: Fri May 08, 2009 4:50 am
I would like opinions about building a server side spelling api to use within an application that I am trying to build. Sinfully aspell and pspell are not really within my requirements because I need it to work for all languages not just the ones supported by aspell. I don't need libraries with words since they might be built after the engine is working.
What I know:
- most spelling engines use double metaphone or similar algorithms to recognize the words that sound like other words that might be the replacement for the word in subject but sinfully the double metaphone, soundex, metaphone and so on and so forth are mostly for english words and words that come from spanish and imported into english.
- levenshtein would be best method to recognize words that look like other words or are misspelled by touching another letter on the keyboard or switching two letters (the classic 'teh' and 'alogrithm' are very well known). some of these words are found by metaphone or soundex but under certain circumstances in the european languages (for example romanian, my own language), some of the words that fall in this category are getting marked as misspelled yet no suggestions can be made by similarity with the actual word since they're not found on the metaphone checking of aspell
What I don't know:
- a way to calculate levenshtein distance between the subject word and all the database words... or filter this amount of words somehow but not by using the metaphone or soundex because while testing with these algorithms the results were not satisfying; and doing this fast enough to make this engine usable on a server environment for an api.
If you guys have any other idea or ... maybe we could brainstorm a bit on this subject in order to build this.
Thanks in advance for your help
What I know:
- most spelling engines use double metaphone or similar algorithms to recognize the words that sound like other words that might be the replacement for the word in subject but sinfully the double metaphone, soundex, metaphone and so on and so forth are mostly for english words and words that come from spanish and imported into english.
- levenshtein would be best method to recognize words that look like other words or are misspelled by touching another letter on the keyboard or switching two letters (the classic 'teh' and 'alogrithm' are very well known). some of these words are found by metaphone or soundex but under certain circumstances in the european languages (for example romanian, my own language), some of the words that fall in this category are getting marked as misspelled yet no suggestions can be made by similarity with the actual word since they're not found on the metaphone checking of aspell
What I don't know:
- a way to calculate levenshtein distance between the subject word and all the database words... or filter this amount of words somehow but not by using the metaphone or soundex because while testing with these algorithms the results were not satisfying; and doing this fast enough to make this engine usable on a server environment for an api.
If you guys have any other idea or ... maybe we could brainstorm a bit on this subject in order to build this.
Thanks in advance for your help