Class - Damerau-Levenshtein distance

Coding Critique is the place to post source code for peer review by other members of DevNetwork. Any kind of code can be posted. Code posted does not have to be limited to PHP. All members are invited to contribute constructive criticism with the goal of improving the code. Posted code should include some background information about it and what areas you specifically would like help with.

Popular code excerpts may be moved to "Code Snippets" by the moderators.

Moderator: General Moderators

User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Class - Damerau-Levenshtein distance

Post by VladSun »

Code: Select all

class DamerauLevenshtein
{
    // Measures the Damerau-Levenshtein distance of two words
    public static function distance($str1, $str2)
    {
        $d = Array();
     
        $lenStr1 = strlen($str1);
        $lenStr2 = strlen($str2);
     
        if ($lenStr1 == 0) 
            return $lenStr2;
            
        if ($lenStr2 == 0) 
            return $lenStr1;
     
        for ($i=0; $i <= $lenStr1; $i++)
        {
            $d[$i] = Array();
            $d[$i][0] = $i;     
        }
 
        for ($j=0; $j <= $lenStr2; $j++)
            $d[0][$j] = $j; 
        
        for ($i=1; $i <= $lenStr1; $i++)
        {
            for ($j=1; $j <= $lenStr2; $j++)
            {
                $cost = substr($str1, $i - 1, 1) == substr($str2, $j - 1, 1) ? 0 : 1;
                
                $d[$i][$j] = min(
                                    $d[$i - 1][$j] + 1,                 // deletion
                                    $d[$i][$j - 1] + 1,                 // insertion
                                    $d[$i - 1][$j - 1] + $cost          // substitution
                                );
                                
                if (    
                    $i > 1 && 
                    $j > 1 && 
                    substr($str1, $i - 1, 1) == substr($str2, $j - 2, 1) && 
                    substr($str1, $i - 2, 1) == substr($str2, $j - 1, 1) 
                )
                   $d[$i][$j] = min(
                                    $d[$i][$j],
                                    $d[$i - 2][$j - 2] + $cost          // transposition
                                 );
            }
 
        }
        return $d[$lenStr1][$lenStr2];
    }
    
    // Case insensitive version of distance()
    public static function distancei($str1, $str2)
    {
        return self::distance(strtolower($str1), strtolower($str2));
    }
    
    // An attempt to measure word similarity in percent 
    public static function similarity($str1, $str2)
    {
        $lenStr1 = strlen($str1);
        $lenStr2 = strlen($str2);
     
        if ($lenStr1 == 0 && $lenStr2 == 0)
            return 100;
 
        $distance = self::distance($str1, $str2);
        $similarity = 100 - (int)round(200 * $distance / ($lenStr1 + $lenStr2));
        return $similarity >= 100 ? 100 : $similarity;
    }
 
    // Case insensitive version of similarity()
    public static function similarityi($str1, $str2)
    {
        return self::similarity(strtolower($str1), strtolower($str2));
    }
    
}
I've implemented the pseudo-code given in http://en.wikipedia.org/wiki/Damerau-Le ... n_distance in PHP.
Three major questions:
- do you think it can be implemented in a better/faster way?
- a better and more accurate version of similarity() function?
- I'm going to add 5-th type of miss typing - a transposition based on the keyboard layout (for now - QWERTY). Any suggestions or ideas about its implementation?

Thank you.
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Re: Class - Damerau-Levenshtein distance

Post by Luke »

How about the native levenshtein function? http://us.php.net/levenshtein :)
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Class - Damerau-Levenshtein distance

Post by VladSun »

I know about it but Damerau-Levenshtein is superior to it ;) At least Wiki says so :) :)
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Re: Class - Damerau-Levenshtein distance

Post by Christopher »

Have you run a bunch of strings through your and the library function to see if it is superior?
(#10850)
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Class - Damerau-Levenshtein distance

Post by VladSun »

I mean it's superior in accuracy of the results:
DamerauLevenshtein::distance('people', 'pepole') = 1
levenshtein('people', 'pepole') = 2
Apparently, if I want to make it superior in speed, I'll have to write it as a PHP module or something like that :)
There are 10 types of people in this world, those who understand binary and those who don't
seanie
Forum Newbie
Posts: 1
Joined: Tue Oct 28, 2008 10:48 pm

Re: Class - Damerau-Levenshtein distance

Post by seanie »

If you were going to use your PHP application with a MySQL database, you could always try my MySQL User Defined Functions:
http://blog.lolyco.com/sean/2008/08/27/ ... positions/
You can still use them even if you're checking two non-stored strings with a SQL statement like:

select dam_lev(string1, string2);

I'd be interested to know how using the UDF in this way compares to your PHP coded function.

-edit- haha URL is laughing at me.
Last edited by Benjamin on Thu Feb 25, 2010 12:10 am, edited 1 time in total.
Reason: Fixed URL
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Class - Damerau-Levenshtein distance

Post by VladSun »

Yes, I've used it mainly with DB.

My validation on a text field (corresponding to a text field in a DB table with unique constraint) is a two stage one - it checks whether the text entered by the user is similar enough to any other texts already entered in the DB. If it so, the form is redisplayed and a warning text is displayed (with the closest texts found). At this stage the user should check whether he has made any spell check errors and decide whether he is entering a new record indeed or it's not the case. On the second form submition with unchanged text, no such checks are performed and the record is inserted in the DB.
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Class - Damerau-Levenshtein distance

Post by VladSun »

VladSun wrote:Apparently, if I want to make it superior in speed, I'll have to write it as a PHP module or something like that :)
:) Guess what! :twisted:

Code: Select all

echo damerau_levenshtein("monday", "mondya")
=> 1
damerau_levenshtein.tar.gz
Source files
(17.45 KiB) Downloaded 807 times
damerau_levenshtein.so.tar.gz
Precompiled module (32bit)
(10.6 KiB) Downloaded 667 times
To compile:

Code: Select all

phpize && ./configure --enable-damerau-levenshtein && make clean && make
Module will be placed in ./modules - copy it to your PHP extensions directory and enable it in php.ini
Last edited by VladSun on Tue Mar 08, 2011 8:58 am, edited 7 times in total.
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Re: Class - Damerau-Levenshtein distance

Post by Weirdan »

Source tarball seems to be broken - 81 bytes is a bit small for a tarball.
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Class - Damerau-Levenshtein distance

Post by VladSun »

Weirdan wrote:Source tarball seems to be broken - 81 bytes is a bit small for a tarball.
Thanks, fixed :)
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Class - Damerau-Levenshtein distance

Post by Eran »

That's pretty sweet! nice work, vlad :P
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Class - Damerau-Levenshtein distance

Post by VladSun »

pytrin wrote:That's pretty sweet! nice work, vlad :P
Thanks :)

Code: Select all

    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "ss|l", &a, &alen, &b, &blen) == FAILURE) 
        RETURN_NULL();
 
An optional "length" parameter was intended to be used for the optimized algorithm - i.e. if the two input strings are too different (threshold is "length") then stop calculations. TODO ;)
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Re: Class - Damerau-Levenshtein distance

Post by Weirdan »

Few notes:
  • You need not to check for emalloc result:
    PHP at the Core: A Hacker's Guide to the Zend Engine wrote: Note: Unlike their C standard library's counterparts the Zend Engine's memory management functions won't return NULL in case of an problem while allocating the requested memory but bail out and terminate the current request.
  • You're leaking memory by not freeing dist array (efree). Granted, it would be cleared at the end of request thanks to emalloc, but for long-running scripts like daemons this garbage will accumulate.
  • No tests?
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Class - Damerau-Levenshtein distance

Post by VladSun »

Yeah, you are absolutely right :)
Shame on me ! :)

Files updated.

PS: Anyone with some good articles on writing PHP extensions?
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Re: Class - Damerau-Levenshtein distance

Post by Weirdan »

VladSun wrote:PS: Anyone with some good articles on writing PHP extensions?
Here are those I remember:
http://books.google.com/books?id=zMbGvK ... q=&f=false
http://blog.libssh2.org/index.php?/categories/1-PHP
Post Reply