Page 1 of 1

LCS algorithm in MySQL stored proc? Or MSSQL?

Posted: Mon Feb 22, 2010 8:50 am
by alex.barylski
I have a database table ironically called sequences. Each sequence is a description of a work which must be carried out at a given maintenance station.

There are tens of thousands of sequences, many only variant in a small way, others are identical but have two words swapped or abbreviations were used instead of full word, etc.

This leads to a great deal of duplication and I would like to build a stored proc (preferably in MySQL) perhaps implemented as LCS algorithm, to calculate a % of similarities.

That is, if two sequences are only off by two words being swapped, their similar would be very high (95% for example) and that would be a good indicator as to possible duplication. One of them could be removed, this is the end goal.

I am familiar with LCS, having implemented it years ago in computer science course, however that was more than 10 years ago and I have since forgotten and lost interest in algorithm development. :P

Ideally there exists a MySQL stored proc somewhere which would calculate the % of similarities of all sequences in the DB and return them ordered by the similarity descendning. Is this possible using stored procs?

Does anyone know of anything similar I could copy or borrow from a blog or something similar/??

Long shot asking this I know just figiured I would ask.

Cheers,
Alex

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Posted: Mon Feb 22, 2010 8:54 am
by Eran
I think you will be better served by using an indexing solution as an intermediary access layer to the database than trying to implement something like this yourself. Something like sphinx, which has great integration with MySQL and very strong performance for fuzzy / similar search.
http://www.sphinxsearch.com/

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Posted: Mon Feb 22, 2010 10:43 am
by alex.barylski
Performance is not really an issue, although I may be speaking prematurely if we build this functionality into our primary listing service. For now I just want something admin's can call to determine whether at two sequences are similar enough to merge into a single record.

I'm not sure full text matching is what I need, as I am not searching for anything in particular, rather a comparison of all records in the table determining which description fields are most similar:
REMOVE ALL SHARP EDGES, CLEAN AND DEBUR.
REMOVE ALL SHARP EDGES, DEBUR AND CLEAN.
Two identical sequences but they will not show on a simple DISTINCT(description) query because they are not identical. This is common in our database due to the many people constantly adding new sequences without thouroughly searching for something similar first.

An LCS algorithm (or derivative of) ideally would list the results in such a way so as to show similar matching results in tandem. So more closely related matches would show first based on possible fulltext matching, field comparison, etc. Another reason I wanted to stick with a SQL solution because I can implementing all that by hand being a very time consuming process.

Cheers,
Alex

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Posted: Mon Feb 22, 2010 11:20 am
by Eran
Aren't you searching the records by some given string? how are you filtering the results?

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Posted: Mon Feb 22, 2010 12:52 pm
by AbraCadaver
This might be helpful. I would use the LEVENSHTEIN_RATIO() and return only matches above a certain percent and order by that descending:

Code: Select all

DELIMITER //
 
CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
  RETURNS INT
    DETERMINISTIC
      BEGIN
        DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
        DECLARE s1_char CHAR;
        DECLARE cv0, cv1 VARBINARY(256);
        SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
        IF s1 = s2 THEN
          RETURN 0;
        ELSEIF s1_len = 0 THEN
          RETURN s2_len;
        ELSEIF s2_len = 0 THEN
          RETURN s1_len;
        ELSE
          WHILE j <= s2_len DO
            SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
          END WHILE;
          WHILE i <= s1_len DO
            SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
            WHILE j <= s2_len DO
                SET c = c + 1;
                IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
                SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
                IF c > c_temp THEN SET c = c_temp; END IF;
                SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
                IF c > c_temp THEN SET c = c_temp; END IF;
                SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
            END WHILE;
            SET cv1 = cv0, i = i + 1;
          END WHILE;
        END IF;
        RETURN c;
      END
//
 
CREATE FUNCTION LEVENSHTEIN_RATIO (s1 VARCHAR(255), s2 VARCHAR(255))
  RETURNS INT
    DETERMINISTIC
      BEGIN
        DECLARE s1_len, s2_len, max_len INT;
        SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);
        IF s1_len > s2_len THEN SET max_len = s1_len; ELSE SET max_len = s2_len; END IF;
        RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);
      END
//
From here: http://codejanitor.com/wp/2007/02/10/le ... -function/

If you wanted to use PHP, there is a levenshtein() function as well as similar_text().

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Posted: Mon Feb 22, 2010 1:32 pm
by AbraCadaver
After thinking on it though, the full text matching can rank them as well. The following aren't that much different conceptually, though I haven't tested:

Code: Select all

SELECT db_column, MATCH (db_column) AGAINST ('$php_var') AS rank
  FROM db_table
  WHERE MATCH (db_column) AGAINST ('$php_var');
 
SELECT db_column, LEVENSHTEIN_RATIO('$php_var', db_column) AS rank
  FROM db_table
  ORDER BY rank DESC;
Throw an additional WHERE condition in either one to get only matches with rank greater than some number.

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Posted: Mon Feb 22, 2010 2:30 pm
by alex.barylski
The only problem with this, is there is no $php_var

I have a table of 20000+ sequences...many of those (relatively speaking) are very similar in description text but off just enough to disable a DISTINCT query in attempt to find duplicates to merge.

That is essentially what I am looking to do. Build a DISTINCT query but not 100% DISTINCT maybe only 85% or less or more, depending on what the admin looking for dupes what to set the value to.

So maybe I am not looking to FULLTEXT, cause that would require a search query, whereas that is not my requirement.

It's basically to assist admin's in finding potential duplicates, I cannot have them enter key searches because that wouldnt help them find duplicates. However just as you could run DISTINCT on a table to return non-duplicates I was hoping to achieve something similar but the reverse, with a % of duplication or something and somehow show the other records which are similar in nature.

I could write something custom in PHP using levenshtein method I suppose. Iterate over every single record using it's own description text to find similar records, with highest rankings appearing at the top of the list. In fact that is probably the better solution now that I better understand myself, what is actually needed here. Maybe a list of 'most likely duplicates' or something :P

Thanks :)
Cheers,
Alex

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Posted: Mon Feb 22, 2010 2:45 pm
by AbraCadaver
OK. I saw two possible scenarios for this, in each you would have a $php_var:

1. Use to help in cleaning up the DB. Admin gets a page with a list of sequences, each one with a "show similar" link, when clicked shows the top ranking similar sequences, each with a delete link next to it.

2. For user entry to keep it somewhat normalized. If a user tries to add a sequence, you can display a list of similar ones and say, "We found these similar sequences already in the system, did you mean one of these?"

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Posted: Mon Feb 22, 2010 8:09 pm
by alex.barylski
The former is what I was thinking initially, the latter is what I started to think would be nice to have as well, yes. :)

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Posted: Mon Feb 22, 2010 8:54 pm
by Weirdan
PCSpectra wrote: That is essentially what I am looking to do. Build a DISTINCT query but not 100% DISTINCT maybe only 85% or less or more, depending on what the admin looking for dupes what to set the value to.
So basically you're looking for query like this:
[sql] SELECT `primary`.`title`, count(*) AS `likely_duplicates`FROM `sequences` `primary`LEFT JOIN `sequences` `other`ON similarity_function(`primary`.`title`, `other`.`title`) > 95GROUP BY `primary`.`id`ORDER BY `likely_duplicates` DESC [/sql]
where you can substitute similarity_function with some string comparison function like levenstein_ratio.

This will be quite expensive query, but it fits your description.

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Posted: Mon Feb 22, 2010 9:12 pm
by AbraCadaver
Yes, for #1 it should be an infrequent upfront maintenance activity, probably not very expensive. An admin loads up the page looks through stuff, deletes it etc.

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Posted: Tue Feb 23, 2010 6:29 am
by VladSun
Using the Weirdan's and AbraCadaver's suggestions you can try this:
[sql]SELECT     `primary`.`title`,     MATCH (`primary`.`title`) AGAINST (`other`.`title`) AS `relevance`FROM     `sequences` `primary`INNER JOIN     `sequences` AS `other`ON     `primary`.`id` <> `other`.`id`    AND     MATCH (`primary`.`title`) AGAINST (`other`.`title`) > 0ORDER BY     `primary`.`id`,    `relevance` DESC[/sql]

Re: LCS algorithm in MySQL stored proc? Or MSSQL?

Posted: Tue Feb 23, 2010 9:05 am
by alex.barylski
This will be quite expensive query, but it fits your description
I figured it would be insane, due to the fact the text fields are well populated and every description has to essentially test itself against all others and determine a percentage of similarity.

Cheers,
Alex