There are tens of thousands of sequences, many only variant in a small way, others are identical but have two words swapped or abbreviations were used instead of full word, etc.
This leads to a great deal of duplication and I would like to build a stored proc (preferably in MySQL) perhaps implemented as LCS algorithm, to calculate a % of similarities.
That is, if two sequences are only off by two words being swapped, their similar would be very high (95% for example) and that would be a good indicator as to possible duplication. One of them could be removed, this is the end goal.
I am familiar with LCS, having implemented it years ago in computer science course, however that was more than 10 years ago and I have since forgotten and lost interest in algorithm development.
Ideally there exists a MySQL stored proc somewhere which would calculate the % of similarities of all sequences in the DB and return them ordered by the similarity descendning. Is this possible using stored procs?
Does anyone know of anything similar I could copy or borrow from a blog or something similar/??
Long shot asking this I know just figiured I would ask.
Cheers,
Alex