Compare paragraphs

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
hawleyjr
BeerMod
Posts: 2170
Joined: Tue Jan 13, 2004 4:58 pm
Location: Jax FL & Spokane WA USA

Compare paragraphs

Post by hawleyjr »

I'm assuming this belongs in the code forum but regex may be the correct solution.

I'm creating a tool for comparing and editing copy. For instance, I have the following two sentences(When used there would be much more copy then two simple sentences)

"Susan walked down the road to get her dog. Her dog ran away."

In the next entry the user can modify the text. However, I would like to display the user's edits in red.

So if the next user changed "road to get her dog" to "trail to get her hat"

Then

"Susan walked down the trail to get her hat. Her was not found."
User avatar
pickle
Briney Mod
Posts: 6445
Joined: Mon Jan 19, 2004 6:11 pm
Location: 53.01N x 112.48W
Contact:

Post by pickle »

The difficulty here (that you may already have thought of), is how to sync up matching words after some difference. By that I mean:

"Susan walked down the road to get her dog"
"Susan ran the road to get her dog"

Obviously "Susan" and "the road to get her dog" is the same for each sentence, but the first sentence has many more words between the two phrases.

There are two things I can think of:
1) Use strtok(), and use tokenized strings to compare
2) Take the original and modified strings, put them both in files with 1 word on each line, and call the linux command diff on the two files. Parsing the output could get you what you need.
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Mediawiki implements this, and it is written in PHP. It's a fairly professional project, so you might want to take a look at its source code. I believe the file is includes/DifferenceEngine.php and class _DiffEngine. Here is the comment right before the class:

Code: Select all

/**
 * Class used internally by Diff to actually compute the diffs.
 *
 * The algorithm used here is mostly lifted from the perl module
 * Algorithm::Diff (version 1.06) by Ned Konz, which is available at:
 *	 http://www.perl.com/CPAN/authors/id/N/N ... f-1.06.zip
 *
 * More ideas are taken from:
 *	 http://www.ics.uci.edu/~eppstein/161/960229.html
 *
 * Some ideas are (and a bit of code) are from from analyze.c, from GNU
 * diffutils-2.7, which can be found at:
 *	 ftp://gnudist.gnu.org/pub/gnu/diffutils ... 2.7.tar.gz
 *
 * closingly, some ideas (subdivision by NCHUNKS > 2, and some optimizations)
 * are my own.
 *
 * @author Geoffrey T. Dairiki
 * @access private
 */
Which suggests some good URLs when you actually go out and look for related algorithms.

And of course, the code is licensed under GNU so if you're using a compatible license, well, feel free.
Syranide
Forum Contributor
Posts: 281
Joined: Fri May 20, 2005 3:16 pm
Location: Sweden

Post by Syranide »

Caution when comparing texts, do know that the complexity is often very costy (if you want exact/good comparisons). Meaning that per-char or per-word comparison can be quite the resource hogg on larger files. So monitor and make sure users can't submit anything as they could likely put your system to the knees if they wanted.

(Such a file could be a large PHP-source code I guess, as PHP itself isn't very fast and comparisons like these are very slow.)
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Hmm... I think Mediawiki handles the diff comparisons fairly well (after all, it has to compute diffs almost constantly).
Post Reply