Compare two txt files and show duplicate lines

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
domainguy
Forum Newbie
Posts: 6
Joined: Fri Oct 17, 2008 9:14 am

Compare two txt files and show duplicate lines

Post by domainguy »

What is the easiest way to compare two very large text files (10+ MB) and show the lines that match? They could be dumped to the display or even better exported to a 3rd text file.

I've searched and searched and can't find a way to do this. I've found an easy way to remove the duplicate lines and create one unique file but not to show only the duplicate lines.

Any help would be appreciated and sorry if this is a newbie question I've searched and banged my head against the wall trying to find a way to do this and I know it has to be something very simple that I'm missing.

:D
User avatar
papa
Forum Regular
Posts: 958
Joined: Wed Aug 27, 2008 3:36 am
Location: Sweden/Sthlm

Re: Compare two txt files and show duplicate lines

Post by papa »

What about using file() and then just match the 2 arrays row by row?

http://us.php.net/manual/en/function.file.php
User avatar
aceconcepts
DevNet Resident
Posts: 1424
Joined: Mon Feb 06, 2006 11:26 am
Location: London

Re: Compare two txt files and show duplicate lines

Post by aceconcepts »

You could use two paralell arrays - adding the duplicate lines accordingly as you progress.
domainguy
Forum Newbie
Posts: 6
Joined: Fri Oct 17, 2008 9:14 am

Re: Compare two txt files and show duplicate lines

Post by domainguy »

I thought about that but if the two files are fairly different you would have to compare row1 of file1 to every row of file2 and so on. Right?

Currently I combined both files and use this to remove the duplicate lines. It's very easy and fast but it doesn't tell me what the dupes are. I'd like to know that somehow.

Code: Select all

<?php
// Load file into Array
$list = file('file1.txt');
 
// Remove duplicates
$list = array_unique($list);
 
// Write back to file
file_put_contents('unique.txt', implode('', $list));
?>
User avatar
papa
Forum Regular
Posts: 958
Joined: Wed Aug 27, 2008 3:36 am
Location: Sweden/Sthlm

Re: Compare two txt files and show duplicate lines

Post by papa »

I was thinking of maybe a for loop:

Code: Select all

 
 
$file1 = file('http://www.example.com/');
$file2 = file('http://www.freesex.com/');
 
for($i=; $i<count($file1); $i++) {
if($file1[$i] == $file2[$i]) echo "match";
}
 
 
Not the best code, but maybe you get the idea...
domainguy
Forum Newbie
Posts: 6
Joined: Fri Oct 17, 2008 9:14 am

Re: Compare two txt files and show duplicate lines

Post by domainguy »

Syntax error on line 6. Hmm...
User avatar
papa
Forum Regular
Posts: 958
Joined: Wed Aug 27, 2008 3:36 am
Location: Sweden/Sthlm

Re: Compare two txt files and show duplicate lines

Post by papa »

for($i = 0

Code: Select all

 
<?php
 
 $file1 = file('default.css');
 $file2 = file('Copy of default.css');
 $row = 1;
  
 for($i=0; $i<count($file1); $i++) {
 if($file1[$i] == $file2[$i]) echo $row." - <i>".$file1[$i]. "</i> - <b>match</b><br />";
 $row++;
 }
 
?>
Seems to work.
domainguy
Forum Newbie
Posts: 6
Joined: Fri Oct 17, 2008 9:14 am

Re: Compare two txt files and show duplicate lines

Post by domainguy »

That's almost it. The only problem is as soon as one line doesn't match it negates the rest of the file.

I need to tweak it so that when one line doesn't match it keeps checking.

For example:

File 1:
apples
oranges
grapes
bananas
pineapples

File 2:
apples
cranberries
bananas
pineapples

Your script would show a match for apples and that's it.

I'll keep working with it.
domainguy
Forum Newbie
Posts: 6
Joined: Fri Oct 17, 2008 9:14 am

Re: Compare two txt files and show duplicate lines

Post by domainguy »

I'm lost. :)

This would be much easier if there was simply an opposite version of the array_unique() function. Say, array_duplicate(). ;)

Anyone have an idea? Let's scrap the two file format. Just one big text file, go through line by line, and export a list of lines that appear more than once within the same file. Any ideas how to do that?
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Compare two txt files and show duplicate lines

Post by VladSun »

If you use Linux:
http://unixhelp.ed.ac.uk/CGI/man-cgi?diff
http://unixhelp.ed.ac.uk/CGI/man-cgi?uniq

And I suppose that it will be much quicker on large files than a PHP implementation.
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
onion2k
Jedi Mod
Posts: 5263
Joined: Tue Dec 21, 2004 5:03 pm
Location: usrlab.com

Re: Compare two txt files and show duplicate lines

Post by onion2k »

I'd suggest opening them both in Textpad and using the Compare Files option on Windows. Far simpler than coding your own solution.
User avatar
onion2k
Jedi Mod
Posts: 5263
Joined: Tue Dec 21, 2004 5:03 pm
Location: usrlab.com

Re: Compare two txt files and show duplicate lines

Post by onion2k »

domainguy wrote:This would be much easier if there was simply an opposite version of the array_unique() function. Say, array_duplicate(). ;)
You could always use array_unique to find the unique ones, then use array_diff() to compare the array of unique entries to the original array... that'd give you any that aren't unique eg the dupes.
domainguy
Forum Newbie
Posts: 6
Joined: Fri Oct 17, 2008 9:14 am

Re: Compare two txt files and show duplicate lines

Post by domainguy »

onion2k wrote:You could always use array_unique to find the unique ones, then use array_diff() to compare the array of unique entries to the original array... that'd give you any that aren't unique eg the dupes.
Someone else mentioned doing it this way. Can you or someone else tell me how to incorporate this into my script? Sorry, I'm really new at this so I'm sure it's something easy but I'm lost. :)

Code: Select all

<?php
// Load file into Array
$list = file('file.txt');
 
// Remove duplicates
$list = array_unique($list);
 
// Write back to file
file_put_contents('uniques.txt', implode('', $list));
?>
User avatar
onion2k
Jedi Mod
Posts: 5263
Joined: Tue Dec 21, 2004 5:03 pm
Location: usrlab.com

Re: Compare two txt files and show duplicate lines

Post by onion2k »

Code: Select all

<?php
 
    $array = array("apples",
            "apples",
            "oranges",
            "grapes",
            "bananas",
            "pineapples",
            "apples",
            "cranberries",
            "bananas",
            "pineapples",
            "grapes",
            "apples");
 
    $unique = array_unique($array);
    $diff = array_diff_assoc($array, $unique);
 
    print_r($array);
    echo "<br>";
    print_r($unique);
    echo "<br>";
    print_r($diff);
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Compare two txt files and show duplicate lines

Post by VladSun »

There will be a "little" problem with this code - it needs at least 3 x file_size memory. As mentioned in the OP, it means 30MB+ memory. So, if it's used on a shared hosting, most probably, it will not work.
A memory friendly (while it's more I/O intensive) solution would be to use fgets().
There are 10 types of people in this world, those who understand binary and those who don't
Post Reply