Comparing 2 large CSV files with PHP
Moderator: General Moderators
Comparing 2 large CSV files with PHP
Hi all,
My objective is to write a piece of code that will compare 2 large CSV files (about 2GB).
I need to check that all entries in 1 files exists in the other file and for those entries that exists on both (the key consists of 1-3 values in each row of the CSV file) , validate that they contain the same values i.e. the rows are identical.
Now, because we're talking about large files that are not ordered in any way, I would imagine this task would take quite a long time.
I am looking for suggestions on how to tackle this problem. The most naive approach is to take each row from the first file and loop through all the rows in the 2nd file trying to find a match. When I do - macth the values.
I was although thinking of crating a C or Java program that will do that and then just call it from my PHP code.
Any creative suggestions?
Many thanks for your help
My objective is to write a piece of code that will compare 2 large CSV files (about 2GB).
I need to check that all entries in 1 files exists in the other file and for those entries that exists on both (the key consists of 1-3 values in each row of the CSV file) , validate that they contain the same values i.e. the rows are identical.
Now, because we're talking about large files that are not ordered in any way, I would imagine this task would take quite a long time.
I am looking for suggestions on how to tackle this problem. The most naive approach is to take each row from the first file and loop through all the rows in the 2nd file trying to find a match. When I do - macth the values.
I was although thinking of crating a C or Java program that will do that and then just call it from my PHP code.
Any creative suggestions?
Many thanks for your help
That would only really work if the two CSV files are in the same order. Given the OP states that the files are "not ordered in any way" that's possibly not the case.dbevfat wrote:Your starting point could be "diff". A while ago I found some articles that explained how to do it, but I can't find any right now. I'm sure you can find ready-made code that compares these two files and maybe only operate on the output (the actual differences)?
Is SQL really the answer?
As I states, the files are not ordered.
I began coding this and my approach was as follows.
Read both files into an array using file() and then run the array_diff() on them.
This will give me the entries that do not exist (i.e. simply not there or there but do not match).
From this point I can take the diff array and try using the key to find the entries in the 2nd file.
If I find the key then there is a mismatch. In which case I need to identify exactly what the mismatch is (probably explode the rows into values and compare with diff again).
If I don't find the key then the row simply do not exist which is something I'm interested in too.
*note- by saying "key" I don't mean an array key but 3 values in a CSV row that uniquely identify an entry.
Now, it seems I can use built-in PHP functions from most of my algorithm except of searching the key in the 2nd file for all those diffs.
Does this seem reasonable? How would you suggest using SQL for this?
I began coding this and my approach was as follows.
Read both files into an array using file() and then run the array_diff() on them.
This will give me the entries that do not exist (i.e. simply not there or there but do not match).
From this point I can take the diff array and try using the key to find the entries in the 2nd file.
If I find the key then there is a mismatch. In which case I need to identify exactly what the mismatch is (probably explode the rows into values and compare with diff again).
If I don't find the key then the row simply do not exist which is something I'm interested in too.
*note- by saying "key" I don't mean an array key but 3 values in a CSV row that uniquely identify an entry.
Now, it seems I can use built-in PHP functions from most of my algorithm except of searching the key in the 2nd file for all those diffs.
Does this seem reasonable? How would you suggest using SQL for this?
Re: Is SQL really the answer?
No, with 2G files this will give you a "hung" php thread and a headache (and maybe a result tomorrow).yarons wrote:Read both files into an array using file() and then run the array_diff() on them.
This will give me the entries that do not exist (i.e. simply not there or there but do not match).
I'm not sure how SQL will go about this (I think worse than if you do it right "manually"), but if you choose this approach DO NOT create indexes on the table before adding the data, do it afterwards. (I'm not 100% sure about that, just heard it somewhere. Please a DB expert comment on this one!)
Otherwise I would go with reading chunks off each file and compare multiple lines at once. If you have a suitable machine which can hold lots of RAM, you may try filling it to the brim with RAM chips and trying to hold HUGE chunks of the second file in memory (or if it's enough - the whole file).
thanks for the advice
Wouldn't be much easier/quicker to write a C/Java program to do that and just call it via PHP using exec?
Re: thanks for the advice
It would be quicker (execution time) to write a C program and execute it from the command line. I thought you had particular interest in a PHP solutionyarons wrote:Wouldn't be much easier/quicker to write a C/Java program to do that and just call it via PHP using exec?
PHP indeed
Yes. my framework is PHP but I figured that for lengthy tasks I can write C code and invoke it from PHP.It would be quicker (execution time) to write a C program and execute it from the command line. I thought you had particular interest in a PHP solution
How quicker will C do the job (I think I'm asking for an educated guess here
Assuming I don't have built in functions like file() or array_diff() , I don't want to spend a few precious days of coding just to find out I gained 5 seconds...
Thoughts?
- Christopher
- Site Administrator
- Posts: 13596
- Joined: Wed Aug 25, 2004 7:54 pm
- Location: New York, NY, US
ok then...
I am not sure how to solve this using MySQL
are you suggesting writing each row as a DB entry? is that as 1 piece or broken down to values as per the CSV?
What is the quickest way to do that? and once in the DB, how will I tackle the comparsion problem?
sorry about all the questions but I'm quite confused on which way to go with this.
are you suggesting writing each row as a DB entry? is that as 1 piece or broken down to values as per the CSV?
What is the quickest way to do that? and once in the DB, how will I tackle the comparsion problem?
sorry about all the questions but I'm quite confused on which way to go with this.
Re: ok then...
Depends on what you need. If you need to compare field-by-field to discover some "partial" differences between two rows, then breaking the rows down to values is ok, otherwise row-to-row is fine, but much less flexible for further processing, because all parsing (if any) has yet to be done.yarons wrote:are you suggesting writing each row as a DB entry? is that as 1 piece or broken down to values as per the CSV?
To refer my first post about using a diff-like comparison: you said the rows aren't sorted. If you insert parsed rows (you put each value - or just the "important" ones - in its column), you can sort them. If you can sort them, you can output them to files again and then use a diff utility. Suggesting diff may be a long shot on my part, but it's an option, although it strongly depends on what you actually want.What is the quickest way to do that? and once in the DB, how will I tackle the comparsion problem?
About speed, I'm guessing LOAD DATA INFILE is very fast, as suggested by arborint. Also, not having indices helps. Mass inserts are usually performed like this:
1) drop indices
2) insert a lot of rows
3) recreate indices
Regards
Ok then. I will try LOAD DATA INFILE and will let you know how it goes.
I do need my compariosn to indicate differences in specific field/values so I will create a db with thos columns.
Once the data is there, will pure SQL do a better job than PHP? i.e. write simple "select" qureies where ever there is a differnce.
Thanks for all your help.
I do need my compariosn to indicate differences in specific field/values so I will create a db with thos columns.
Once the data is there, will pure SQL do a better job than PHP? i.e. write simple "select" qureies where ever there is a differnce.
Thanks for all your help.
- Christopher
- Site Administrator
- Posts: 13596
- Joined: Wed Aug 25, 2004 7:54 pm
- Location: New York, NY, US
Yes and a DB engine will be faster than any programmed algorithm. You probably want to join the tables and then check for NULL values.yarons wrote:Ok then. I will try LOAD DATA INFILE and will let you know how it goes.
I do need my compariosn to indicate differences in specific field/values so I will create a db with thos columns.
Once the data is there, will pure SQL do a better job than PHP? i.e. write simple "select" qureies where ever there is a differnce..
(#10850)