Large Size Database

Questions about the MySQL, PostgreSQL, and most other databases, as well as using it with PHP can be asked here.

Moderator: General Moderators

urieilam
Forum Newbie
Posts: 10
Joined: Sat Sep 09, 2006 1:24 pm

Large Size Database

Post by urieilam »

Hi Everyone,

I was wondering if someone out there has some experience with working with large databases. I'm working on a backend which sends newsletters and maintains a database of the contacts. It should hold around 40,000 contacts, with around 6 fields per contact. I'm using PHP 5.

The backend is already programmed, and functions ok. The problem is that it is extremely slow.

The database pulls info out of .csv files. On top of just entering the contacts in to the database, it compares existing information, in order to avoid double entries for the same email addresses. I've scripted it to chop the data into batches of 1000, and to reload the script at the end of each batch.

It currently takes more than an hour to upload a new csv with about 10000 contacts, compare the info, and insert the information.
I've tried running it both with a SQL database and a XML one (I know it's not the best idea to use XML for this, but hoped it would be quicker), and both take around the same amount of time.

Any ideas on what the right design for something like this would be? Are there any quicker ways to go about it? I'm not much of a database expert, especially when it comes to something as huge as this.

Any help would be appreciated!
Thanks, uri
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Large Size Database

Post by Eran »

40,000 rows is relatively a small database. Mysql can easily handle millions and more rows per table.
Reading your description, the first thing comes to mind that you might be running too many queries (query per row?) to complete your process. If you could post some of the code and queries that are under performing we might be able to help you optimize.
urieilam
Forum Newbie
Posts: 10
Joined: Sat Sep 09, 2006 1:24 pm

Re: Large Size Database

Post by urieilam »

Hi Pytrin,
I guess you're right. I've been runnign two queries on every row of the .csv, one to check if the contact already exists, and another one to either insert or update the contact, depending on the answer. Basically, the first one is:
"SELECT * FROM addresses WHERE email='$email'"
and the second one either:
"UPDATE addresses SET lan='$lan' , email='$email' WHERE id = '$num'"
or
"INSERT INTO addresses VALUES ('','$i','','','$i','en')"

I guess it would be quicker to first read all the email addresses in to one array, then compare that array with each row of the csv. What about the insertion? Is there a quicker way to do that, not row by row, but all at once (I assume any insertion would require some kind of loop)?
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Large Size Database

Post by Eran »

If you can spare the memory, I'd recommend reading the entire CSV into an array. Compose a query that searches for multiple duplicates in one go (if the entire file is too much, break it into chunks). An alternative method would be load the CSV directly into a (temporary?) table (LOAD DATA INFILE) and run a query that returns only the unique rows between that table and the contacts table. Insertion still has to be done one at a time.

However, this might not even be necessary. 10,000 rows is not a lot, and MySQL should be able to handle this load in a few minutes. If you could, run EXPLAIN on each of the queries you are running and post the results here (by the way, is there an index on the email column?)
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Large Size Database

Post by VladSun »

Or you can simply let MySQL to do his work by using a proper DB design ;)
Create an unique constraint on all columns you need to be with unique value combination.
E.g.
[sql]CREATE TABLE `contact` (  `phone` varchar(11) NOT NULL,  `name` varchar(11) NOT NULL,  UNIQUE KEY `name_phone_pair` (`phone`,`name`)) ENGINE=MyISAM[/sql]

Then simply import your CSV file without worrying about duplicates in a single simple query:
[sql]LOAD DATA LOCAL INFILE '/data.csv' INTO TABLE `contact`[/sql]

All duplicate records (according to unique constraints you have defined) will be skipped ;)
Last edited by VladSun on Thu Nov 13, 2008 4:10 pm, edited 1 time in total.
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Large Size Database

Post by Eran »

He wants to update existing records as well
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Large Size Database

Post by VladSun »

pytrin wrote:He wants to update existing records as well
[sql]LOAD DATA LOCAL INFILE '/data.csv' REPLACE INTO TABLE mytable[/sql]

:lol: :twisted: :twisted: :twisted:
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Large Size Database

Post by Eran »

That's pretty good :P

Does the primary key keep its value this way?
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Large Size Database

Post by VladSun »

pytrin wrote:That's pretty good :P

Does the primary key keep its value this way?
It's not typical in CSV import that the primary key is taken from the CSV data itself.
My tests showed that a an auto_increment primary key `id` will change its value on every replace on the row made.
That's so because replace indeed is a DELETE and INSERT action.
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Large Size Database

Post by Eran »

Then it's not good for integrity... but maybe it's not needed
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Large Size Database

Post by VladSun »

pytrin wrote:Then it's not good for integrity... but maybe it's not needed
Inetgrity and CSV ... no such thing ;)
There are 10 types of people in this world, those who understand binary and those who don't
urieilam
Forum Newbie
Posts: 10
Joined: Sat Sep 09, 2006 1:24 pm

Re: Large Size Database

Post by urieilam »

Thanks for the suggestions. Vladsun, I took your idea of making the email column unique. The problem is that there is another column I need to keep, i.e. not have updated, called "status" - it marks if someone unsubscribes. So, instead of using LOAD DATA, I'm using an INSERT query:

$query = "INSERT INTO addresses VALUES ('','$status','','','$email','$lan')
ON DUPLICATE KEY UPDATE lan='$lan'...";

it's already cut a lot of the time off by having everything run on just one query. Do you know of a way I could use LOAD DATA and update only certain fields on duplicates?
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Large Size Database

Post by Eran »

If integrity is not important to you (ie, you don't need the ID of the row to remain constant) you can use Vlad's suggestion and use LOAD ... REPLACE

Code: Select all

 
LOAD DATA LOCAL INFILE '/path/to/data.csv' REPLACE
INTO TABLE addresses
 
Alternatively you could load the CSV into a temporary table using LOAD INFILE, and then use one query to do all the inserting / updating:

Code: Select all

 
CREATE TEMPORARY TABLE csv_file (
     //table definition here
);
LOAD DATA LOCAL INFILE '/path/to/data.csv' INTO TABLE csv_file;
INSERT INTO addresses (col1,col2,...) 
      SELECT col1,col2,... FROM csv_file 
      ON DUPLICATE KEY UPDATE col1='col1'...";
 
 
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Re: Large Size Database

Post by josh »

You can use a REPLACE instead of a SELECT & INSERT / UPDATE. ( which is not the same thing as the LOAD DATA syntax recommended above )

You should also get your queries and run them through EXPLAIN and output each query and it's EXPLAIN results here, so we can see if you're making any mistakes performance wise.

You may also be overlooking some key normalization / de-normalization in your design that could solve the bottleneck

Have you bench marked the CVS parsing separately from the database?
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Large Size Database

Post by VladSun »

The main question (or at least the most important to clarify) is: where does this CSV file comes from, how it's generated??
There are 10 types of people in this world, those who understand binary and those who don't
Post Reply