Page 1 of 1

Compare two txt files and show duplicate lines

Posted: Fri Oct 17, 2008 9:16 am
by domainguy
What is the easiest way to compare two very large text files (10+ MB) and show the lines that match? They could be dumped to the display or even better exported to a 3rd text file.

I've searched and searched and can't find a way to do this. I've found an easy way to remove the duplicate lines and create one unique file but not to show only the duplicate lines.

Any help would be appreciated and sorry if this is a newbie question I've searched and banged my head against the wall trying to find a way to do this and I know it has to be something very simple that I'm missing.

:D

Re: Compare two txt files and show duplicate lines

Posted: Fri Oct 17, 2008 9:22 am
by papa
What about using file() and then just match the 2 arrays row by row?

http://us.php.net/manual/en/function.file.php

Re: Compare two txt files and show duplicate lines

Posted: Fri Oct 17, 2008 9:27 am
by aceconcepts
You could use two paralell arrays - adding the duplicate lines accordingly as you progress.

Re: Compare two txt files and show duplicate lines

Posted: Fri Oct 17, 2008 9:28 am
by domainguy
I thought about that but if the two files are fairly different you would have to compare row1 of file1 to every row of file2 and so on. Right?

Currently I combined both files and use this to remove the duplicate lines. It's very easy and fast but it doesn't tell me what the dupes are. I'd like to know that somehow.

Code: Select all

<?php
// Load file into Array
$list = file('file1.txt');
 
// Remove duplicates
$list = array_unique($list);
 
// Write back to file
file_put_contents('unique.txt', implode('', $list));
?>

Re: Compare two txt files and show duplicate lines

Posted: Fri Oct 17, 2008 9:36 am
by papa
I was thinking of maybe a for loop:

Code: Select all

 
 
$file1 = file('http://www.example.com/');
$file2 = file('http://www.freesex.com/');
 
for($i=; $i<count($file1); $i++) {
if($file1[$i] == $file2[$i]) echo "match";
}
 
 
Not the best code, but maybe you get the idea...

Re: Compare two txt files and show duplicate lines

Posted: Fri Oct 17, 2008 10:00 am
by domainguy
Syntax error on line 6. Hmm...

Re: Compare two txt files and show duplicate lines

Posted: Fri Oct 17, 2008 10:01 am
by papa
for($i = 0

Code: Select all

 
<?php
 
 $file1 = file('default.css');
 $file2 = file('Copy of default.css');
 $row = 1;
  
 for($i=0; $i<count($file1); $i++) {
 if($file1[$i] == $file2[$i]) echo $row." - <i>".$file1[$i]. "</i> - <b>match</b><br />";
 $row++;
 }
 
?>
Seems to work.

Re: Compare two txt files and show duplicate lines

Posted: Fri Oct 17, 2008 10:12 am
by domainguy
That's almost it. The only problem is as soon as one line doesn't match it negates the rest of the file.

I need to tweak it so that when one line doesn't match it keeps checking.

For example:

File 1:
apples
oranges
grapes
bananas
pineapples

File 2:
apples
cranberries
bananas
pineapples

Your script would show a match for apples and that's it.

I'll keep working with it.

Re: Compare two txt files and show duplicate lines

Posted: Fri Oct 17, 2008 10:45 am
by domainguy
I'm lost. :)

This would be much easier if there was simply an opposite version of the array_unique() function. Say, array_duplicate(). ;)

Anyone have an idea? Let's scrap the two file format. Just one big text file, go through line by line, and export a list of lines that appear more than once within the same file. Any ideas how to do that?

Re: Compare two txt files and show duplicate lines

Posted: Fri Oct 17, 2008 11:04 am
by VladSun
If you use Linux:
http://unixhelp.ed.ac.uk/CGI/man-cgi?diff
http://unixhelp.ed.ac.uk/CGI/man-cgi?uniq

And I suppose that it will be much quicker on large files than a PHP implementation.

Re: Compare two txt files and show duplicate lines

Posted: Fri Oct 17, 2008 2:46 pm
by onion2k
I'd suggest opening them both in Textpad and using the Compare Files option on Windows. Far simpler than coding your own solution.

Re: Compare two txt files and show duplicate lines

Posted: Fri Oct 17, 2008 2:49 pm
by onion2k
domainguy wrote:This would be much easier if there was simply an opposite version of the array_unique() function. Say, array_duplicate(). ;)
You could always use array_unique to find the unique ones, then use array_diff() to compare the array of unique entries to the original array... that'd give you any that aren't unique eg the dupes.

Re: Compare two txt files and show duplicate lines

Posted: Fri Oct 17, 2008 2:57 pm
by domainguy
onion2k wrote:You could always use array_unique to find the unique ones, then use array_diff() to compare the array of unique entries to the original array... that'd give you any that aren't unique eg the dupes.
Someone else mentioned doing it this way. Can you or someone else tell me how to incorporate this into my script? Sorry, I'm really new at this so I'm sure it's something easy but I'm lost. :)

Code: Select all

<?php
// Load file into Array
$list = file('file.txt');
 
// Remove duplicates
$list = array_unique($list);
 
// Write back to file
file_put_contents('uniques.txt', implode('', $list));
?>

Re: Compare two txt files and show duplicate lines

Posted: Fri Oct 17, 2008 3:29 pm
by onion2k

Code: Select all

<?php
 
    $array = array("apples",
            "apples",
            "oranges",
            "grapes",
            "bananas",
            "pineapples",
            "apples",
            "cranberries",
            "bananas",
            "pineapples",
            "grapes",
            "apples");
 
    $unique = array_unique($array);
    $diff = array_diff_assoc($array, $unique);
 
    print_r($array);
    echo "<br>";
    print_r($unique);
    echo "<br>";
    print_r($diff);

Re: Compare two txt files and show duplicate lines

Posted: Sat Oct 18, 2008 5:50 pm
by VladSun
There will be a "little" problem with this code - it needs at least 3 x file_size memory. As mentioned in the OP, it means 30MB+ memory. So, if it's used on a shared hosting, most probably, it will not work.
A memory friendly (while it's more I/O intensive) solution would be to use fgets().