Page 1 of 1
Compare paragraphs
Posted: Thu Jun 09, 2005 2:05 pm
by hawleyjr
I'm assuming this belongs in the code forum but regex may be the correct solution.
I'm creating a tool for comparing and editing copy. For instance, I have the following two sentences(When used there would be much more copy then two simple sentences)
"Susan walked down the road to get her dog. Her dog ran away."
In the next entry the user can modify the text. However, I would like to display the user's edits in red.
So if the next user changed "road to get her dog" to "trail to get her hat"
Then
"Susan walked down the trail to get her hat. Her was not found."
Posted: Thu Jun 09, 2005 4:49 pm
by pickle
The difficulty here (that you may already have thought of), is how to sync up matching words after some difference. By that I mean:
"Susan walked down the road to get her dog"
"Susan ran the road to get her dog"
Obviously "Susan" and "the road to get her dog" is the same for each sentence, but the first sentence has many more words between the two phrases.
There are two things I can think of:
1) Use strtok(), and use tokenized strings to compare
2) Take the original and modified strings, put them both in files with 1 word on each line, and call the linux command diff on the two files. Parsing the output could get you what you need.
Posted: Thu Jun 09, 2005 8:56 pm
by Ambush Commander
Mediawiki implements this, and it is written in PHP. It's a fairly professional project, so you might want to take a look at its source code. I believe the file is
includes/DifferenceEngine.php and class
_DiffEngine. Here is the comment right before the class:
Code: Select all
/**
* Class used internally by Diff to actually compute the diffs.
*
* The algorithm used here is mostly lifted from the perl module
* Algorithm::Diff (version 1.06) by Ned Konz, which is available at:
* http://www.perl.com/CPAN/authors/id/N/N ... f-1.06.zip
*
* More ideas are taken from:
* http://www.ics.uci.edu/~eppstein/161/960229.html
*
* Some ideas are (and a bit of code) are from from analyze.c, from GNU
* diffutils-2.7, which can be found at:
* ftp://gnudist.gnu.org/pub/gnu/diffutils ... 2.7.tar.gz
*
* closingly, some ideas (subdivision by NCHUNKS > 2, and some optimizations)
* are my own.
*
* @author Geoffrey T. Dairiki
* @access private
*/
Which suggests some good URLs when you actually go out and look for related algorithms.
And of course, the code is licensed under GNU so if you're using a compatible license, well, feel free.
Posted: Fri Jun 10, 2005 2:30 am
by Syranide
Caution when comparing texts, do know that the complexity is often very costy (if you want exact/good comparisons). Meaning that per-char or per-word comparison can be quite the resource hogg on larger files. So monitor and make sure users can't submit anything as they could likely put your system to the knees if they wanted.
(Such a file could be a large PHP-source code I guess, as PHP itself isn't very fast and comparisons like these are very slow.)
Posted: Fri Jun 10, 2005 3:27 pm
by Ambush Commander
Hmm... I think Mediawiki handles the diff comparisons fairly well (after all, it has to compute diffs almost constantly).