Page 1 of 2

Class - Damerau-Levenshtein distance

Posted: Mon Oct 06, 2008 3:19 pm
by VladSun

Code: Select all

class DamerauLevenshtein
{
    // Measures the Damerau-Levenshtein distance of two words
    public static function distance($str1, $str2)
    {
        $d = Array();
     
        $lenStr1 = strlen($str1);
        $lenStr2 = strlen($str2);
     
        if ($lenStr1 == 0) 
            return $lenStr2;
            
        if ($lenStr2 == 0) 
            return $lenStr1;
     
        for ($i=0; $i <= $lenStr1; $i++)
        {
            $d[$i] = Array();
            $d[$i][0] = $i;     
        }
 
        for ($j=0; $j <= $lenStr2; $j++)
            $d[0][$j] = $j; 
        
        for ($i=1; $i <= $lenStr1; $i++)
        {
            for ($j=1; $j <= $lenStr2; $j++)
            {
                $cost = substr($str1, $i - 1, 1) == substr($str2, $j - 1, 1) ? 0 : 1;
                
                $d[$i][$j] = min(
                                    $d[$i - 1][$j] + 1,                 // deletion
                                    $d[$i][$j - 1] + 1,                 // insertion
                                    $d[$i - 1][$j - 1] + $cost          // substitution
                                );
                                
                if (    
                    $i > 1 && 
                    $j > 1 && 
                    substr($str1, $i - 1, 1) == substr($str2, $j - 2, 1) && 
                    substr($str1, $i - 2, 1) == substr($str2, $j - 1, 1) 
                )
                   $d[$i][$j] = min(
                                    $d[$i][$j],
                                    $d[$i - 2][$j - 2] + $cost          // transposition
                                 );
            }
 
        }
        return $d[$lenStr1][$lenStr2];
    }
    
    // Case insensitive version of distance()
    public static function distancei($str1, $str2)
    {
        return self::distance(strtolower($str1), strtolower($str2));
    }
    
    // An attempt to measure word similarity in percent 
    public static function similarity($str1, $str2)
    {
        $lenStr1 = strlen($str1);
        $lenStr2 = strlen($str2);
     
        if ($lenStr1 == 0 && $lenStr2 == 0)
            return 100;
 
        $distance = self::distance($str1, $str2);
        $similarity = 100 - (int)round(200 * $distance / ($lenStr1 + $lenStr2));
        return $similarity >= 100 ? 100 : $similarity;
    }
 
    // Case insensitive version of similarity()
    public static function similarityi($str1, $str2)
    {
        return self::similarity(strtolower($str1), strtolower($str2));
    }
    
}
I've implemented the pseudo-code given in http://en.wikipedia.org/wiki/Damerau-Le ... n_distance in PHP.
Three major questions:
- do you think it can be implemented in a better/faster way?
- a better and more accurate version of similarity() function?
- I'm going to add 5-th type of miss typing - a transposition based on the keyboard layout (for now - QWERTY). Any suggestions or ideas about its implementation?

Thank you.

Re: Class - Damerau-Levenshtein distance

Posted: Mon Oct 06, 2008 6:50 pm
by Luke
How about the native levenshtein function? http://us.php.net/levenshtein :)

Re: Class - Damerau-Levenshtein distance

Posted: Mon Oct 06, 2008 7:23 pm
by VladSun
I know about it but Damerau-Levenshtein is superior to it ;) At least Wiki says so :) :)

Re: Class - Damerau-Levenshtein distance

Posted: Mon Oct 06, 2008 10:20 pm
by Christopher
Have you run a bunch of strings through your and the library function to see if it is superior?

Re: Class - Damerau-Levenshtein distance

Posted: Tue Oct 07, 2008 5:04 am
by VladSun
I mean it's superior in accuracy of the results:
DamerauLevenshtein::distance('people', 'pepole') = 1
levenshtein('people', 'pepole') = 2
Apparently, if I want to make it superior in speed, I'll have to write it as a PHP module or something like that :)

Re: Class - Damerau-Levenshtein distance

Posted: Tue Oct 28, 2008 10:56 pm
by seanie
If you were going to use your PHP application with a MySQL database, you could always try my MySQL User Defined Functions:
http://blog.lolyco.com/sean/2008/08/27/ ... positions/
You can still use them even if you're checking two non-stored strings with a SQL statement like:

select dam_lev(string1, string2);

I'd be interested to know how using the UDF in this way compares to your PHP coded function.

-edit- haha URL is laughing at me.

Re: Class - Damerau-Levenshtein distance

Posted: Wed Oct 29, 2008 6:24 am
by VladSun
Yes, I've used it mainly with DB.

My validation on a text field (corresponding to a text field in a DB table with unique constraint) is a two stage one - it checks whether the text entered by the user is similar enough to any other texts already entered in the DB. If it so, the form is redisplayed and a warning text is displayed (with the closest texts found). At this stage the user should check whether he has made any spell check errors and decide whether he is entering a new record indeed or it's not the case. On the second form submition with unchanged text, no such checks are performed and the record is inserted in the DB.

Re: Class - Damerau-Levenshtein distance

Posted: Mon Jan 18, 2010 9:15 am
by VladSun
VladSun wrote:Apparently, if I want to make it superior in speed, I'll have to write it as a PHP module or something like that :)
:) Guess what! :twisted:

Code: Select all

echo damerau_levenshtein("monday", "mondya")
=> 1
damerau_levenshtein.tar.gz
Source files
(17.45 KiB) Downloaded 807 times
damerau_levenshtein.so.tar.gz
Precompiled module (32bit)
(10.6 KiB) Downloaded 667 times
To compile:

Code: Select all

phpize && ./configure --enable-damerau-levenshtein && make clean && make
Module will be placed in ./modules - copy it to your PHP extensions directory and enable it in php.ini

Re: Class - Damerau-Levenshtein distance

Posted: Mon Jan 18, 2010 9:27 am
by Weirdan
Source tarball seems to be broken - 81 bytes is a bit small for a tarball.

Re: Class - Damerau-Levenshtein distance

Posted: Mon Jan 18, 2010 9:34 am
by VladSun
Weirdan wrote:Source tarball seems to be broken - 81 bytes is a bit small for a tarball.
Thanks, fixed :)

Re: Class - Damerau-Levenshtein distance

Posted: Mon Jan 18, 2010 9:47 am
by Eran
That's pretty sweet! nice work, vlad :P

Re: Class - Damerau-Levenshtein distance

Posted: Mon Jan 18, 2010 9:55 am
by VladSun
pytrin wrote:That's pretty sweet! nice work, vlad :P
Thanks :)

Code: Select all

    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "ss|l", &a, &alen, &b, &blen) == FAILURE) 
        RETURN_NULL();
 
An optional "length" parameter was intended to be used for the optimized algorithm - i.e. if the two input strings are too different (threshold is "length") then stop calculations. TODO ;)

Re: Class - Damerau-Levenshtein distance

Posted: Mon Jan 18, 2010 10:01 am
by Weirdan
Few notes:
  • You need not to check for emalloc result:
    PHP at the Core: A Hacker's Guide to the Zend Engine wrote: Note: Unlike their C standard library's counterparts the Zend Engine's memory management functions won't return NULL in case of an problem while allocating the requested memory but bail out and terminate the current request.
  • You're leaking memory by not freeing dist array (efree). Granted, it would be cleared at the end of request thanks to emalloc, but for long-running scripts like daemons this garbage will accumulate.
  • No tests?

Re: Class - Damerau-Levenshtein distance

Posted: Mon Jan 18, 2010 5:20 pm
by VladSun
Yeah, you are absolutely right :)
Shame on me ! :)

Files updated.

PS: Anyone with some good articles on writing PHP extensions?

Re: Class - Damerau-Levenshtein distance

Posted: Mon Jan 18, 2010 8:33 pm
by Weirdan
VladSun wrote:PS: Anyone with some good articles on writing PHP extensions?
Here are those I remember:
http://books.google.com/books?id=zMbGvK ... q=&f=false
http://blog.libssh2.org/index.php?/categories/1-PHP