LCS algorithm in MySQL stored proc? Or MSSQL?

Questions about the MySQL, PostgreSQL, and most other databases, as well as using it with PHP can be asked here.

Moderator: General Moderators

Post Reply
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

LCS algorithm in MySQL stored proc? Or MSSQL?

Post by alex.barylski »

I have a database table ironically called sequences. Each sequence is a description of a work which must be carried out at a given maintenance station.

There are tens of thousands of sequences, many only variant in a small way, others are identical but have two words swapped or abbreviations were used instead of full word, etc.

This leads to a great deal of duplication and I would like to build a stored proc (preferably in MySQL) perhaps implemented as LCS algorithm, to calculate a % of similarities.

That is, if two sequences are only off by two words being swapped, their similar would be very high (95% for example) and that would be a good indicator as to possible duplication. One of them could be removed, this is the end goal.

I am familiar with LCS, having implemented it years ago in computer science course, however that was more than 10 years ago and I have since forgotten and lost interest in algorithm development. :P

Ideally there exists a MySQL stored proc somewhere which would calculate the % of similarities of all sequences in the DB and return them ordered by the similarity descendning. Is this possible using stored procs?

Does anyone know of anything similar I could copy or borrow from a blog or something similar/??

Long shot asking this I know just figiured I would ask.

Cheers,
Alex
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Post by Eran »

I think you will be better served by using an indexing solution as an intermediary access layer to the database than trying to implement something like this yourself. Something like sphinx, which has great integration with MySQL and very strong performance for fuzzy / similar search.
http://www.sphinxsearch.com/
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Post by alex.barylski »

Performance is not really an issue, although I may be speaking prematurely if we build this functionality into our primary listing service. For now I just want something admin's can call to determine whether at two sequences are similar enough to merge into a single record.

I'm not sure full text matching is what I need, as I am not searching for anything in particular, rather a comparison of all records in the table determining which description fields are most similar:
REMOVE ALL SHARP EDGES, CLEAN AND DEBUR.
REMOVE ALL SHARP EDGES, DEBUR AND CLEAN.
Two identical sequences but they will not show on a simple DISTINCT(description) query because they are not identical. This is common in our database due to the many people constantly adding new sequences without thouroughly searching for something similar first.

An LCS algorithm (or derivative of) ideally would list the results in such a way so as to show similar matching results in tandem. So more closely related matches would show first based on possible fulltext matching, field comparison, etc. Another reason I wanted to stick with a SQL solution because I can implementing all that by hand being a very time consuming process.

Cheers,
Alex
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Post by Eran »

Aren't you searching the records by some given string? how are you filtering the results?
User avatar
AbraCadaver
DevNet Master
Posts: 2572
Joined: Mon Feb 24, 2003 10:12 am
Location: The Republic of Texas
Contact:

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Post by AbraCadaver »

This might be helpful. I would use the LEVENSHTEIN_RATIO() and return only matches above a certain percent and order by that descending:

Code: Select all

DELIMITER //
 
CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
  RETURNS INT
    DETERMINISTIC
      BEGIN
        DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
        DECLARE s1_char CHAR;
        DECLARE cv0, cv1 VARBINARY(256);
        SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
        IF s1 = s2 THEN
          RETURN 0;
        ELSEIF s1_len = 0 THEN
          RETURN s2_len;
        ELSEIF s2_len = 0 THEN
          RETURN s1_len;
        ELSE
          WHILE j <= s2_len DO
            SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
          END WHILE;
          WHILE i <= s1_len DO
            SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
            WHILE j <= s2_len DO
                SET c = c + 1;
                IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
                SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
                IF c > c_temp THEN SET c = c_temp; END IF;
                SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
                IF c > c_temp THEN SET c = c_temp; END IF;
                SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
            END WHILE;
            SET cv1 = cv0, i = i + 1;
          END WHILE;
        END IF;
        RETURN c;
      END
//
 
CREATE FUNCTION LEVENSHTEIN_RATIO (s1 VARCHAR(255), s2 VARCHAR(255))
  RETURNS INT
    DETERMINISTIC
      BEGIN
        DECLARE s1_len, s2_len, max_len INT;
        SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);
        IF s1_len > s2_len THEN SET max_len = s1_len; ELSE SET max_len = s2_len; END IF;
        RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);
      END
//
From here: http://codejanitor.com/wp/2007/02/10/le ... -function/

If you wanted to use PHP, there is a levenshtein() function as well as similar_text().
mysql_function(): WARNING: This extension is deprecated as of PHP 5.5.0, and will be removed in the future. Instead, the MySQLi or PDO_MySQLextension should be used. See also MySQL: choosing an API guide and related FAQ for more information.
User avatar
AbraCadaver
DevNet Master
Posts: 2572
Joined: Mon Feb 24, 2003 10:12 am
Location: The Republic of Texas
Contact:

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Post by AbraCadaver »

After thinking on it though, the full text matching can rank them as well. The following aren't that much different conceptually, though I haven't tested:

Code: Select all

SELECT db_column, MATCH (db_column) AGAINST ('$php_var') AS rank
  FROM db_table
  WHERE MATCH (db_column) AGAINST ('$php_var');
 
SELECT db_column, LEVENSHTEIN_RATIO('$php_var', db_column) AS rank
  FROM db_table
  ORDER BY rank DESC;
Throw an additional WHERE condition in either one to get only matches with rank greater than some number.
mysql_function(): WARNING: This extension is deprecated as of PHP 5.5.0, and will be removed in the future. Instead, the MySQLi or PDO_MySQLextension should be used. See also MySQL: choosing an API guide and related FAQ for more information.
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Post by alex.barylski »

The only problem with this, is there is no $php_var

I have a table of 20000+ sequences...many of those (relatively speaking) are very similar in description text but off just enough to disable a DISTINCT query in attempt to find duplicates to merge.

That is essentially what I am looking to do. Build a DISTINCT query but not 100% DISTINCT maybe only 85% or less or more, depending on what the admin looking for dupes what to set the value to.

So maybe I am not looking to FULLTEXT, cause that would require a search query, whereas that is not my requirement.

It's basically to assist admin's in finding potential duplicates, I cannot have them enter key searches because that wouldnt help them find duplicates. However just as you could run DISTINCT on a table to return non-duplicates I was hoping to achieve something similar but the reverse, with a % of duplication or something and somehow show the other records which are similar in nature.

I could write something custom in PHP using levenshtein method I suppose. Iterate over every single record using it's own description text to find similar records, with highest rankings appearing at the top of the list. In fact that is probably the better solution now that I better understand myself, what is actually needed here. Maybe a list of 'most likely duplicates' or something :P

Thanks :)
Cheers,
Alex
User avatar
AbraCadaver
DevNet Master
Posts: 2572
Joined: Mon Feb 24, 2003 10:12 am
Location: The Republic of Texas
Contact:

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Post by AbraCadaver »

OK. I saw two possible scenarios for this, in each you would have a $php_var:

1. Use to help in cleaning up the DB. Admin gets a page with a list of sequences, each one with a "show similar" link, when clicked shows the top ranking similar sequences, each with a delete link next to it.

2. For user entry to keep it somewhat normalized. If a user tries to add a sequence, you can display a list of similar ones and say, "We found these similar sequences already in the system, did you mean one of these?"
mysql_function(): WARNING: This extension is deprecated as of PHP 5.5.0, and will be removed in the future. Instead, the MySQLi or PDO_MySQLextension should be used. See also MySQL: choosing an API guide and related FAQ for more information.
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Post by alex.barylski »

The former is what I was thinking initially, the latter is what I started to think would be nice to have as well, yes. :)
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Post by Weirdan »

PCSpectra wrote: That is essentially what I am looking to do. Build a DISTINCT query but not 100% DISTINCT maybe only 85% or less or more, depending on what the admin looking for dupes what to set the value to.
So basically you're looking for query like this:
[sql] SELECT `primary`.`title`, count(*) AS `likely_duplicates`FROM `sequences` `primary`LEFT JOIN `sequences` `other`ON similarity_function(`primary`.`title`, `other`.`title`) > 95GROUP BY `primary`.`id`ORDER BY `likely_duplicates` DESC [/sql]
where you can substitute similarity_function with some string comparison function like levenstein_ratio.

This will be quite expensive query, but it fits your description.
User avatar
AbraCadaver
DevNet Master
Posts: 2572
Joined: Mon Feb 24, 2003 10:12 am
Location: The Republic of Texas
Contact:

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Post by AbraCadaver »

Yes, for #1 it should be an infrequent upfront maintenance activity, probably not very expensive. An admin loads up the page looks through stuff, deletes it etc.
mysql_function(): WARNING: This extension is deprecated as of PHP 5.5.0, and will be removed in the future. Instead, the MySQLi or PDO_MySQLextension should be used. See also MySQL: choosing an API guide and related FAQ for more information.
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Post by VladSun »

Using the Weirdan's and AbraCadaver's suggestions you can try this:
[sql]SELECT     `primary`.`title`,     MATCH (`primary`.`title`) AGAINST (`other`.`title`) AS `relevance`FROM     `sequences` `primary`INNER JOIN     `sequences` AS `other`ON     `primary`.`id` <> `other`.`id`    AND     MATCH (`primary`.`title`) AGAINST (`other`.`title`) > 0ORDER BY     `primary`.`id`,    `relevance` DESC[/sql]
There are 10 types of people in this world, those who understand binary and those who don't
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Post by alex.barylski »

This will be quite expensive query, but it fits your description
I figured it would be insane, due to the fact the text fields are well populated and every description has to essentially test itself against all others and determine a percentage of similarity.

Cheers,
Alex
Post Reply