Comparing 2 large CSV files with PHP

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

yarons
Forum Newbie
Posts: 24
Joined: Mon Sep 11, 2006 7:25 am

Comparing 2 large CSV files with PHP

Post by yarons »

Hi all,

My objective is to write a piece of code that will compare 2 large CSV files (about 2GB).
I need to check that all entries in 1 files exists in the other file and for those entries that exists on both (the key consists of 1-3 values in each row of the CSV file) , validate that they contain the same values i.e. the rows are identical.

Now, because we're talking about large files that are not ordered in any way, I would imagine this task would take quite a long time.
I am looking for suggestions on how to tackle this problem. The most naive approach is to take each row from the first file and loop through all the rows in the 2nd file trying to find a match. When I do - macth the values.

I was although thinking of crating a C or Java program that will do that and then just call it from my PHP code.
Any creative suggestions?

Many thanks for your help
User avatar
onion2k
Jedi Mod
Posts: 5263
Joined: Tue Dec 21, 2004 5:03 pm
Location: usrlab.com

Post by onion2k »

I think I'd import the data into a couple of tables in a database and then compare them using SQL .. a simple right join should be enough to check all the records from one table exist in the other.
User avatar
dbevfat
Forum Contributor
Posts: 126
Joined: Tue Jun 28, 2005 2:47 pm
Location: Ljubljana, Slovenia

Post by dbevfat »

Your starting point could be "diff". A while ago I found some articles that explained how to do it, but I can't find any right now. I'm sure you can find ready-made code that compares these two files and maybe only operate on the output (the actual differences)?
User avatar
onion2k
Jedi Mod
Posts: 5263
Joined: Tue Dec 21, 2004 5:03 pm
Location: usrlab.com

Post by onion2k »

dbevfat wrote:Your starting point could be "diff". A while ago I found some articles that explained how to do it, but I can't find any right now. I'm sure you can find ready-made code that compares these two files and maybe only operate on the output (the actual differences)?
That would only really work if the two CSV files are in the same order. Given the OP states that the files are "not ordered in any way" that's possibly not the case.
yarons
Forum Newbie
Posts: 24
Joined: Mon Sep 11, 2006 7:25 am

Is SQL really the answer?

Post by yarons »

As I states, the files are not ordered.
I began coding this and my approach was as follows.

Read both files into an array using file() and then run the array_diff() on them.
This will give me the entries that do not exist (i.e. simply not there or there but do not match).
From this point I can take the diff array and try using the key to find the entries in the 2nd file.
If I find the key then there is a mismatch. In which case I need to identify exactly what the mismatch is (probably explode the rows into values and compare with diff again).
If I don't find the key then the row simply do not exist which is something I'm interested in too.

*note- by saying "key" I don't mean an array key but 3 values in a CSV row that uniquely identify an entry.

Now, it seems I can use built-in PHP functions from most of my algorithm except of searching the key in the 2nd file for all those diffs.

Does this seem reasonable? How would you suggest using SQL for this?
User avatar
Mordred
DevNet Resident
Posts: 1579
Joined: Sun Sep 03, 2006 5:19 am
Location: Sofia, Bulgaria

Re: Is SQL really the answer?

Post by Mordred »

yarons wrote:Read both files into an array using file() and then run the array_diff() on them.
This will give me the entries that do not exist (i.e. simply not there or there but do not match).
No, with 2G files this will give you a "hung" php thread and a headache (and maybe a result tomorrow).

I'm not sure how SQL will go about this (I think worse than if you do it right "manually"), but if you choose this approach DO NOT create indexes on the table before adding the data, do it afterwards. (I'm not 100% sure about that, just heard it somewhere. Please a DB expert comment on this one!)

Otherwise I would go with reading chunks off each file and compare multiple lines at once. If you have a suitable machine which can hold lots of RAM, you may try filling it to the brim with RAM chips and trying to hold HUGE chunks of the second file in memory (or if it's enough - the whole file).
yarons
Forum Newbie
Posts: 24
Joined: Mon Sep 11, 2006 7:25 am

thanks for the advice

Post by yarons »

Wouldn't be much easier/quicker to write a C/Java program to do that and just call it via PHP using exec?
User avatar
Mordred
DevNet Resident
Posts: 1579
Joined: Sun Sep 03, 2006 5:19 am
Location: Sofia, Bulgaria

Re: thanks for the advice

Post by Mordred »

yarons wrote:Wouldn't be much easier/quicker to write a C/Java program to do that and just call it via PHP using exec?
It would be quicker (execution time) to write a C program and execute it from the command line. I thought you had particular interest in a PHP solution ;)
yarons
Forum Newbie
Posts: 24
Joined: Mon Sep 11, 2006 7:25 am

PHP indeed

Post by yarons »

It would be quicker (execution time) to write a C program and execute it from the command line. I thought you had particular interest in a PHP solution
Yes. my framework is PHP but I figured that for lengthy tasks I can write C code and invoke it from PHP.
How quicker will C do the job (I think I'm asking for an educated guess here :))?

Assuming I don't have built in functions like file() or array_diff() , I don't want to spend a few precious days of coding just to find out I gained 5 seconds...

Thoughts?
User avatar
dbevfat
Forum Contributor
Posts: 126
Joined: Tue Jun 28, 2005 2:47 pm
Location: Ljubljana, Slovenia

Post by dbevfat »

If you go with DB (which is a reasonable idea, given that your records are not sorted), then the difference between PHP and/or C/java should be marginal compared to the actual comparation time.
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Post by Christopher »

I agree on the DB route. Something like MySQL's "LOAD DATA INFILE" statement is very high performance. Once in a DB you could solve the problem easily.
(#10850)
yarons
Forum Newbie
Posts: 24
Joined: Mon Sep 11, 2006 7:25 am

ok then...

Post by yarons »

I am not sure how to solve this using MySQL

are you suggesting writing each row as a DB entry? is that as 1 piece or broken down to values as per the CSV?

What is the quickest way to do that? and once in the DB, how will I tackle the comparsion problem?

sorry about all the questions but I'm quite confused on which way to go with this.
User avatar
dbevfat
Forum Contributor
Posts: 126
Joined: Tue Jun 28, 2005 2:47 pm
Location: Ljubljana, Slovenia

Re: ok then...

Post by dbevfat »

yarons wrote:are you suggesting writing each row as a DB entry? is that as 1 piece or broken down to values as per the CSV?
Depends on what you need. If you need to compare field-by-field to discover some "partial" differences between two rows, then breaking the rows down to values is ok, otherwise row-to-row is fine, but much less flexible for further processing, because all parsing (if any) has yet to be done.
What is the quickest way to do that? and once in the DB, how will I tackle the comparsion problem?
To refer my first post about using a diff-like comparison: you said the rows aren't sorted. If you insert parsed rows (you put each value - or just the "important" ones - in its column), you can sort them. If you can sort them, you can output them to files again and then use a diff utility. Suggesting diff may be a long shot on my part, but it's an option, although it strongly depends on what you actually want.

About speed, I'm guessing LOAD DATA INFILE is very fast, as suggested by arborint. Also, not having indices helps. Mass inserts are usually performed like this:
1) drop indices
2) insert a lot of rows
3) recreate indices

Regards
yarons
Forum Newbie
Posts: 24
Joined: Mon Sep 11, 2006 7:25 am

Post by yarons »

Ok then. I will try LOAD DATA INFILE and will let you know how it goes.
I do need my compariosn to indicate differences in specific field/values so I will create a db with thos columns.

Once the data is there, will pure SQL do a better job than PHP? i.e. write simple "select" qureies where ever there is a differnce.

Thanks for all your help.
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Post by Christopher »

yarons wrote:Ok then. I will try LOAD DATA INFILE and will let you know how it goes.
I do need my compariosn to indicate differences in specific field/values so I will create a db with thos columns.

Once the data is there, will pure SQL do a better job than PHP? i.e. write simple "select" qureies where ever there is a differnce..
Yes and a DB engine will be faster than any programmed algorithm. You probably want to join the tables and then check for NULL values.
(#10850)
Post Reply