Page 1 of 1

Mysql Limitations

Posted: Tue Feb 12, 2008 2:06 am
by John Cartwright
I'm faced with the same situation as I did several months ago, where I am tracking views and clicks of banners and links for promotional campaigns.. all of which need very detailed tracking. Now, each link gets a tracking id which ties it into a much larger array of information (but that is not important here).

Last time I did this I ran very quickly into the default 4GB mysql limit and soon realized that mysql may not necessarily be my best tool here. Performance quickly became an issue, especially when performing queries that performed calculations on sometimes ~50 million rows. Obviously one cannot expect a 50 million row calculation to be pretty fast.. but I don't want to cripple the server at the same time. The queries involved joins about 2-3 other small tables, so the query itself isn't very big.

Lets assume I can use any hardware configuration.. realistically I have several servers at my disposal.

So my initial questions into this massive problem of mine..

1) What kind of hardware configurations are recommended for high volume database access
2) What database is best suited for extremely large volumes of data (hundreds of millions+)
3) What kind of impact does hundreds of inserts a second have on mysql?

Before someone goes about bashing me that this kind of data processing is not suitable for a web based language, seeking alternative languages is certainly an absolute last resort.

Re: Mysql Limitations

Posted: Tue Feb 12, 2008 3:28 am
by onion2k
MySQL should be fine, but you will need to optimise the way you insert and subsequently access the data.

First and foremost - it's a lot faster to do lots of inserts all at once on a MyISAM table. I would recommend creating a second HEAP table (HEAP is only held in memory) and have the web application insert into that table - it'll be very fast indeed. Then, every N minutes, run a script to flush all the data from the HEAP table into the MyISAM table with:

Code: Select all

INSERT INTO `table_myisam` (`column1`,`column2`,`column3`) VALUES SELECT `column1`, `column2`, `column3` FROM `table_heap`;
Then clear the HEAP table with:

Code: Select all

TRUNCATE `table_heap`;
If the server crashes you'll lose the content in the HEAP table which is an unfortunate side effect - but that could be alleviated by using replication to a second database server perhaps.

Secondly, I wouldn't actually ever access the data in a 'live' situation. Run reports nightly when it's quiet.

Re: Mysql Limitations

Posted: Tue Feb 12, 2008 3:45 am
by Christopher
I think one question I would have is whether you need all of those logged rows once you have done your calculations. In the past I have done what webstats programs do, which is to reduce all the raw data records in a date range into subtotals and then do the actual stats on those. So you would run your queries on the raw data, then export it as a backup and delete those rows.

Re: Mysql Limitations

Posted: Tue Feb 12, 2008 12:07 pm
by liljester
have you looked into creating a mysql cluster?

Re: Mysql Limitations

Posted: Tue Feb 12, 2008 1:22 pm
by Benjamin
You may want to look into running select queries off slave servers to reduce the load. Memcache may help depending on how fresh certain data needs to be. Finally it may be good just to track all the data and run a nightly cron job to generate all the reports.

Re: Mysql Limitations

Posted: Tue Feb 12, 2008 7:49 pm
by John Cartwright
onion2k wrote:MySQL should be fine, but you will need to optimise the way you insert and subsequently access the data.

First and foremost - it's a lot faster to do lots of inserts all at once on a MyISAM table. I would recommend creating a second HEAP table (HEAP is only held in memory) and have the web application insert into that table - it'll be very fast indeed. Then, every N minutes, run a script to flush all the data from the HEAP table into the MyISAM table with:

Code: Select all

INSERT INTO `table_myisam` (`column1`,`column2`,`column3`) VALUES SELECT `column1`, `column2`, `column3` FROM `table_heap`;
Then clear the HEAP table with:

Code: Select all

TRUNCATE `table_heap`;
If the server crashes you'll lose the content in the HEAP table which is an unfortunate side effect - but that could be alleviated by using replication to a second database server perhaps.
Excellent idea!
Secondly, I wouldn't actually ever access the data in a 'live' situation. Run reports nightly when it's quiet.
The best solution I can think of is grouping records with the same tracking id into hour slots, then simply updating that record + 1. The reason I can't go for an all out reporting solution is because there needs to be a lot of transparency of the data as you want to view more details.

I.e. initially you will see a list of sites, and see all the stats grouped by site,
the second level is grouped by affiliate per each site,
the third level is campaigns being run by that affiliate,
the fourth level will be the complete data breakdown the campaign

My boss said the lowest we can go is hourly builds, however I am still interested in reducing the amount of data being stored furthur.
arborint wrote:I think one question I would have is whether you need all of those logged rows once you have done your calculations. In the past I have done what webstats programs do, which is to reduce all the raw data records in a date range into subtotals and then do the actual stats on those. So you would run your queries on the raw data, then export it as a backup and delete those rows.
Indeed I am wanting to move away from raw data.. the last system we had was about 120 million rows of data (which was eventually corrupted and lost :banghead:).
have you looked into creating a mysql cluster?
Certainly :)
You may want to look into running select queries off slave servers to reduce the load. Memcache may help depending on how fresh certain data needs to be. Finally it may be good just to track all the data and run a nightly cron job to generate all the reports.
Would it be advisable to use memcache with large data sets? I think our servers have 16gb of memory at their disposal.

Re: Mysql Limitations

Posted: Tue Feb 12, 2008 7:55 pm
by Benjamin
I believe memcache is typically used to cache the results returned by resource intensive or slow queries to reduce the load on MySQL servers.

Re: Mysql Limitations

Posted: Wed Feb 13, 2008 6:08 pm
by Weirdan
onion2k wrote: First and foremost - it's a lot faster to do lots of inserts all at once on a MyISAM table. I would recommend creating a second HEAP table (HEAP is only held in memory) and have the web application insert into that table - it'll be very fast indeed. Then, every N minutes, run a script to flush all the data from the HEAP table into the MyISAM
Similar effect could be achieved with INSERT DELAYED - this will pass the query to dedicated mysql thread, which will execute it when it will have enough resources.
The best solution I can think of is grouping records with the same tracking id into hour slots, then simply updating that record + 1.
This is good idea too - I used it four years ago and was able to record about 70 hits per second on pretty mediocre hardware. Using a storage engine that supports row level locks (InnoDB) also might help - otherwise long-running selects (reports, data transformation) will block updates on the table.

Re: Mysql Limitations

Posted: Thu Feb 14, 2008 8:12 am
by liljester
Im very curious to see which path you decide to take, let us know how it turns out!