Page 1 of 2
Class - Damerau-Levenshtein distance
Posted: Mon Oct 06, 2008 3:19 pm
by VladSun
Code: Select all
class DamerauLevenshtein
{
// Measures the Damerau-Levenshtein distance of two words
public static function distance($str1, $str2)
{
$d = Array();
$lenStr1 = strlen($str1);
$lenStr2 = strlen($str2);
if ($lenStr1 == 0)
return $lenStr2;
if ($lenStr2 == 0)
return $lenStr1;
for ($i=0; $i <= $lenStr1; $i++)
{
$d[$i] = Array();
$d[$i][0] = $i;
}
for ($j=0; $j <= $lenStr2; $j++)
$d[0][$j] = $j;
for ($i=1; $i <= $lenStr1; $i++)
{
for ($j=1; $j <= $lenStr2; $j++)
{
$cost = substr($str1, $i - 1, 1) == substr($str2, $j - 1, 1) ? 0 : 1;
$d[$i][$j] = min(
$d[$i - 1][$j] + 1, // deletion
$d[$i][$j - 1] + 1, // insertion
$d[$i - 1][$j - 1] + $cost // substitution
);
if (
$i > 1 &&
$j > 1 &&
substr($str1, $i - 1, 1) == substr($str2, $j - 2, 1) &&
substr($str1, $i - 2, 1) == substr($str2, $j - 1, 1)
)
$d[$i][$j] = min(
$d[$i][$j],
$d[$i - 2][$j - 2] + $cost // transposition
);
}
}
return $d[$lenStr1][$lenStr2];
}
// Case insensitive version of distance()
public static function distancei($str1, $str2)
{
return self::distance(strtolower($str1), strtolower($str2));
}
// An attempt to measure word similarity in percent
public static function similarity($str1, $str2)
{
$lenStr1 = strlen($str1);
$lenStr2 = strlen($str2);
if ($lenStr1 == 0 && $lenStr2 == 0)
return 100;
$distance = self::distance($str1, $str2);
$similarity = 100 - (int)round(200 * $distance / ($lenStr1 + $lenStr2));
return $similarity >= 100 ? 100 : $similarity;
}
// Case insensitive version of similarity()
public static function similarityi($str1, $str2)
{
return self::similarity(strtolower($str1), strtolower($str2));
}
}
I've implemented the pseudo-code given in
http://en.wikipedia.org/wiki/Damerau-Le ... n_distance in PHP.
Three major questions:
- do you think it can be implemented in a better/faster way?
- a better and more accurate version of similarity() function?
- I'm going to add 5-th type of miss typing - a transposition based on the keyboard layout (for now - QWERTY). Any suggestions or ideas about its implementation?
Thank you.
Re: Class - Damerau-Levenshtein distance
Posted: Mon Oct 06, 2008 6:50 pm
by Luke
How about the native levenshtein function?
http://us.php.net/levenshtein 
Re: Class - Damerau-Levenshtein distance
Posted: Mon Oct 06, 2008 7:23 pm
by VladSun
I know about it but Damerau-Levenshtein is superior to it

At least Wiki says so

Re: Class - Damerau-Levenshtein distance
Posted: Mon Oct 06, 2008 10:20 pm
by Christopher
Have you run a bunch of strings through your and the library function to see if it is superior?
Re: Class - Damerau-Levenshtein distance
Posted: Tue Oct 07, 2008 5:04 am
by VladSun
I mean it's superior in accuracy of the results:
DamerauLevenshtein::distance('people', 'pepole') = 1
levenshtein('people', 'pepole') = 2
Apparently, if I want to make it superior in speed, I'll have to write it as a PHP module or something like that

Re: Class - Damerau-Levenshtein distance
Posted: Tue Oct 28, 2008 10:56 pm
by seanie
If you were going to use your PHP application with a MySQL database, you could always try my MySQL User Defined Functions:
http://blog.lolyco.com/sean/2008/08/27/ ... positions/
You can still use them even if you're checking two non-stored strings with a SQL statement like:
select dam_lev(string1, string2);
I'd be interested to know how using the UDF in this way compares to your PHP coded function.
-edit- haha URL is laughing at me.
Re: Class - Damerau-Levenshtein distance
Posted: Wed Oct 29, 2008 6:24 am
by VladSun
Yes, I've used it mainly with DB.
My validation on a text field (corresponding to a text field in a DB table with unique constraint) is a two stage one - it checks whether the text entered by the user is similar enough to any other texts already entered in the DB. If it so, the form is redisplayed and a warning text is displayed (with the closest texts found). At this stage the user should check whether he has made any spell check errors and decide whether he is entering a new record indeed or it's not the case. On the second form submition with unchanged text, no such checks are performed and the record is inserted in the DB.
Re: Class - Damerau-Levenshtein distance
Posted: Mon Jan 18, 2010 9:15 am
by VladSun
VladSun wrote:Apparently, if I want to make it superior in speed, I'll have to write it as a PHP module or something like that


Guess what!
Code: Select all
echo damerau_levenshtein("monday", "mondya")
=> 1
To compile:
Code: Select all
phpize && ./configure --enable-damerau-levenshtein && make clean && make
Module will be placed in ./modules - copy it to your PHP extensions directory and enable it in php.ini
Re: Class - Damerau-Levenshtein distance
Posted: Mon Jan 18, 2010 9:27 am
by Weirdan
Source tarball seems to be broken - 81 bytes is a bit small for a tarball.
Re: Class - Damerau-Levenshtein distance
Posted: Mon Jan 18, 2010 9:34 am
by VladSun
Weirdan wrote:Source tarball seems to be broken - 81 bytes is a bit small for a tarball.
Thanks, fixed

Re: Class - Damerau-Levenshtein distance
Posted: Mon Jan 18, 2010 9:47 am
by Eran
That's pretty sweet! nice work, vlad

Re: Class - Damerau-Levenshtein distance
Posted: Mon Jan 18, 2010 9:55 am
by VladSun
pytrin wrote:That's pretty sweet! nice work, vlad

Thanks
Code: Select all
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "ss|l", &a, &alen, &b, &blen) == FAILURE)
RETURN_NULL();
An optional "length" parameter was intended to be used for the optimized algorithm - i.e. if the two input strings are too different (threshold is "length") then stop calculations. TODO

Re: Class - Damerau-Levenshtein distance
Posted: Mon Jan 18, 2010 10:01 am
by Weirdan
Few notes:
- You need not to check for emalloc result:
PHP at the Core: A Hacker's Guide to the Zend Engine wrote:
Note: Unlike their C standard library's counterparts the Zend Engine's memory management functions won't return NULL in case of an problem while allocating the requested memory but bail out and terminate the current request.
- You're leaking memory by not freeing dist array (efree). Granted, it would be cleared at the end of request thanks to emalloc, but for long-running scripts like daemons this garbage will accumulate.
- No tests?
Re: Class - Damerau-Levenshtein distance
Posted: Mon Jan 18, 2010 5:20 pm
by VladSun
Yeah, you are absolutely right

Shame on me !
Files updated.
PS: Anyone with some good articles on writing PHP extensions?
Re: Class - Damerau-Levenshtein distance
Posted: Mon Jan 18, 2010 8:33 pm
by Weirdan