Page 1 of 2
Large Size Database
Posted: Wed Nov 12, 2008 9:56 am
by urieilam
Hi Everyone,
I was wondering if someone out there has some experience with working with large databases. I'm working on a backend which sends newsletters and maintains a database of the contacts. It should hold around 40,000 contacts, with around 6 fields per contact. I'm using PHP 5.
The backend is already programmed, and functions ok. The problem is that it is extremely slow.
The database pulls info out of .csv files. On top of just entering the contacts in to the database, it compares existing information, in order to avoid double entries for the same email addresses. I've scripted it to chop the data into batches of 1000, and to reload the script at the end of each batch.
It currently takes more than an hour to upload a new csv with about 10000 contacts, compare the info, and insert the information.
I've tried running it both with a SQL database and a XML one (I know it's not the best idea to use XML for this, but hoped it would be quicker), and both take around the same amount of time.
Any ideas on what the right design for something like this would be? Are there any quicker ways to go about it? I'm not much of a database expert, especially when it comes to something as huge as this.
Any help would be appreciated!
Thanks, uri
Re: Large Size Database
Posted: Wed Nov 12, 2008 10:15 am
by Eran
40,000 rows is relatively a small database. Mysql can easily handle millions and more rows per table.
Reading your description, the first thing comes to mind that you might be running too many queries (query per row?) to complete your process. If you could post some of the code and queries that are under performing we might be able to help you optimize.
Re: Large Size Database
Posted: Wed Nov 12, 2008 6:21 pm
by urieilam
Hi Pytrin,
I guess you're right. I've been runnign two queries on every row of the .csv, one to check if the contact already exists, and another one to either insert or update the contact, depending on the answer. Basically, the first one is:
"SELECT * FROM addresses WHERE email='$email'"
and the second one either:
"UPDATE addresses SET lan='$lan' , email='$email' WHERE id = '$num'"
or
"INSERT INTO addresses VALUES ('','$i','','','$i','en')"
I guess it would be quicker to first read all the email addresses in to one array, then compare that array with each row of the csv. What about the insertion? Is there a quicker way to do that, not row by row, but all at once (I assume any insertion would require some kind of loop)?
Re: Large Size Database
Posted: Wed Nov 12, 2008 6:33 pm
by Eran
If you can spare the memory, I'd recommend reading the entire CSV into an array. Compose a query that searches for multiple duplicates in one go (if the entire file is too much, break it into chunks). An alternative method would be load the CSV directly into a (temporary?) table (LOAD DATA INFILE) and run a query that returns only the unique rows between that table and the contacts table. Insertion still has to be done one at a time.
However, this might not even be necessary. 10,000 rows is not a lot, and MySQL should be able to handle this load in a few minutes. If you could, run EXPLAIN on each of the queries you are running and post the results here (by the way, is there an index on the email column?)
Re: Large Size Database
Posted: Wed Nov 12, 2008 7:11 pm
by VladSun
Or you can simply let MySQL to do his work by using a proper DB design

Create an unique constraint on all columns you need to be with unique value combination.
E.g.
[sql]CREATE TABLE `contact` ( `phone` varchar(11) NOT NULL, `name` varchar(11) NOT NULL, UNIQUE KEY `name_phone_pair` (`phone`,`name`)) ENGINE=MyISAM[/sql]
Then simply import your CSV file without worrying about duplicates in a single simple query:
[sql]LOAD DATA LOCAL INFILE '/data.csv' INTO TABLE `contact`[/sql]
All duplicate records (according to unique constraints you have defined) will be skipped

Re: Large Size Database
Posted: Wed Nov 12, 2008 7:25 pm
by Eran
He wants to update existing records as well
Re: Large Size Database
Posted: Wed Nov 12, 2008 7:35 pm
by VladSun
Re: Large Size Database
Posted: Wed Nov 12, 2008 7:50 pm
by Eran
That's pretty good
Does the primary key keep its value this way?
Re: Large Size Database
Posted: Wed Nov 12, 2008 8:04 pm
by VladSun
pytrin wrote:That's pretty good
Does the primary key keep its value this way?
It's not typical in CSV import that the primary key is taken from the CSV data itself.
My tests showed that a an auto_increment primary key `id` will change its value on every replace on the row made.
That's so because replace indeed is a DELETE and INSERT action.
Re: Large Size Database
Posted: Wed Nov 12, 2008 8:05 pm
by Eran
Then it's not good for integrity... but maybe it's not needed
Re: Large Size Database
Posted: Wed Nov 12, 2008 8:13 pm
by VladSun
pytrin wrote:Then it's not good for integrity... but maybe it's not needed
Inetgrity and CSV ... no such thing

Re: Large Size Database
Posted: Thu Nov 13, 2008 8:49 am
by urieilam
Thanks for the suggestions. Vladsun, I took your idea of making the email column unique. The problem is that there is another column I need to keep, i.e. not have updated, called "status" - it marks if someone unsubscribes. So, instead of using LOAD DATA, I'm using an INSERT query:
$query = "INSERT INTO addresses VALUES ('','$status','','','$email','$lan')
ON DUPLICATE KEY UPDATE lan='$lan'...";
it's already cut a lot of the time off by having everything run on just one query. Do you know of a way I could use LOAD DATA and update only certain fields on duplicates?
Re: Large Size Database
Posted: Thu Nov 13, 2008 10:33 am
by Eran
If integrity is not important to you (ie, you don't need the ID of the row to remain constant) you can use Vlad's suggestion and use LOAD ... REPLACE
Code: Select all
LOAD DATA LOCAL INFILE '/path/to/data.csv' REPLACE
INTO TABLE addresses
Alternatively you could load the CSV into a temporary table using LOAD INFILE, and then use one query to do all the inserting / updating:
Code: Select all
CREATE TEMPORARY TABLE csv_file (
//table definition here
);
LOAD DATA LOCAL INFILE '/path/to/data.csv' INTO TABLE csv_file;
INSERT INTO addresses (col1,col2,...)
SELECT col1,col2,... FROM csv_file
ON DUPLICATE KEY UPDATE col1='col1'...";
Re: Large Size Database
Posted: Thu Nov 13, 2008 12:53 pm
by josh
You can use a REPLACE instead of a SELECT & INSERT / UPDATE. ( which is not the same thing as the LOAD DATA syntax recommended above )
You should also get your queries and run them through EXPLAIN and output each query and it's EXPLAIN results here, so we can see if you're making any mistakes performance wise.
You may also be overlooking some key normalization / de-normalization in your design that could solve the bottleneck
Have you bench marked the CVS parsing separately from the database?
Re: Large Size Database
Posted: Thu Nov 13, 2008 4:13 pm
by VladSun
The main question (or at least the most important to clarify) is: where does this CSV file comes from, how it's generated??