Page 1 of 2
counting rows vs. stored # in database
Posted: Sat Feb 04, 2006 4:01 pm
by s.dot
So, say a database table has a few hundred thousand records and is always getting written to and deleted from. The number of rows is always fluctuating.
What's the best way to count rows, with a given where clause.
IE: SELECT count(*) FROM `table` WHERE `foo` = '$bar'
That will return a row count, but what about cpu usage scrambling through those hundreds of thousands of rows? Would it be less cpu intensive to store a count in a field in a database table and increment/decrement that field?
So I guess it's a debate over cpu usage counting rows, or cpu usage updating a count field. Which would be the best way to go?
Posted: Sat Feb 04, 2006 5:56 pm
by dbevfat
I think this could be called premature optimization. Besides, it complicates things somewhat. Stick to count() until you're sure this is the bottleneck worth optimizing. You do that with profiling:
http://c2.com/cgi/wiki?ProfileBeforeOptimizing. For profiling, I use XDebug and KCacheGrind.
See also:
First rule of optimization:
http://c2.com/cgi/wiki?FirstRuleOfOptimization
Second rule of optimization:
http://c2.com/cgi/wiki?SecondRuleOfOptimization
Btw; did you run this COUNT on the table a few times? How long did it take? I doubt the execution time is critical

Posted: Sat Feb 04, 2006 6:11 pm
by John Cartwright
I've already run the benchmarks on this in another thread. I cannot find it though..
I ran it on a database with about 10, 000 rows and if I remember corrected I only queried
vs.
which resulted in COUNT(*) being faster than fetching a single row.. but once you throw in the WHERE clause, `post_count` blows COUNT(*) out of the water..
Re: counting rows vs. stored # in database
Posted: Sat Feb 04, 2006 6:32 pm
by Christopher
scrotaye wrote:That will return a row count, but what about cpu usage scrambling through those hundreds of thousands of rows? Would it be less cpu intensive to store a count in a field in a database table and increment/decrement that field?
So I guess it's a debate over cpu usage counting rows, or cpu usage updating a count field. Which would be the best way to go?
I agree that this is premature implementation. But the larger point is that the goal of removing dependencies to your Model/Domain layer is that is shouldn't matter to your application which one is used. Start with COUNT(*) and migrate to a different solution as needed.
I have used stored values, but the problem with them is that you absolutely have to use transactions to implement them or they become incorrect quickly. That's a performance hit and then you need to so regular reality check syncs to compare acutal vs calculated. You'd be better off putting your energies into tuning your database and the table schema than heading down that road.
Posted: Sat Feb 04, 2006 6:37 pm
by s.dot
Actually, that's 100% correct, this IS preoptimization.
But I've ran several websites in the past (of which JshPro2 helped me optimize) where counting rows ran the server load up to 70 (whoa!), but when we stored the post count as opposed to counting rows it brought it down to under 2.
With the current website I'm working on, I expect a high amount of volume in the future and I want to take care of this now rather than down the road when it needs it.
So I guess I could further refine my question... with proper indexing of database tables, is there a better method than SELECT count(*) (or anything that could improve it), or would storing a number be the best option, with of course some scripts to ensure accuracy from time to time?
Posted: Sat Feb 04, 2006 7:11 pm
by Christopher
scrotaye wrote:So I guess I could further refine my question... with proper indexing of database tables, is there a better method than SELECT count(*) (or anything that could improve it), or would storing a number be the best option, with of course some scripts to ensure accuracy from time to time?
I know of no generally faster way than COUNT(*), but other methods are faster in different cases.
Posted: Sat Feb 04, 2006 7:34 pm
by josh
Scrotaye, the reason your server load was 70 is because of the complex nature of your where clause and the large size of the table, the table was over 15 million rows if I remember and you had absolutely no indexing at all. When I added an index on the fields the server load immediately dropped to under 10, just like that. Since your row calculation was only increasing we could run a
Code: Select all
UPDATE delayed `users` SET `friends` = `friends` + 1
Which tells mysql to take it's time updating the friends count, in most cases this will not result in more then 2-3 second delays in updating the counts. (meaing the query runs 2-3 seconds from the time it is called, not it takes 2-3 seconds to run)
As a general rule if you are selecting something like a post count the stored value is always going to be faster, simply because reading the number one thousand out of a record is going to kill in terms of performance counting one thousand rows one by one. The exception would be a myISAM table, because they store the total row count, innoDB does not because of problems related to the transactions I think. Also as a general rule if you have a field in the where clause and that query is running often, your site is going to crawl without an index on the field.
I would also say this is not over-optimizing. Lets say that a forums page is displaying 10 posts per page, all the posters on that page have post counts of 1,500 or more. You are looking at 15,000 rows.. every single time that page loads (I think mysql caches results of queries but that's no good because posts are continually happening so the counts dont get cached for more then a few seconds at a time). So you are looking at counting 15,000+ rows vs. reading from 10 rows, also you are most likely selecting from your users table in the first place to get get the username from the userid (your database is normalized, isn't it?), I would say the decision to go with stored post count would be a very smart one considering these facts.
Posted: Sat Feb 04, 2006 7:39 pm
by Christopher
jshpro2 wrote:I would say the decision to go with stored post count would be a very smart one considering these facts.
I would agree that would be smart solution for that problem. And because only the user creating a post will update their record there is not need for transactions or anything fancy.
Posted: Sat Feb 04, 2006 7:43 pm
by printf
SELECT column_name if you need something, if you don't use COUNT is always faster, even using WHERE as long as you INDEX what your asking after the WHERE. I mean that's a basic optimization rule!
pif
Posted: Sat Feb 04, 2006 9:04 pm
by josh
printf wrote:SELECT column_name if you need something, if you don't use COUNT is always faster
Hmm, I think you missed a word in there or something (honest mistake), but do you care to elaborate on what you mean by "if you need something" and "if you don't use COUNT is always faster"?
Posted: Sat Feb 04, 2006 10:10 pm
by Christopher
jshpro2 wrote:Hmm, I think you missed a word in there or something (honest mistake), but do you care to elaborate on what you mean by "if you need something" and "if you don't use COUNT is always faster"?
I've had this converstation before and a number of guys have verified that "SELECT COUNT(*)" is generally (but not in every case) faster that "SELECT field". I believe the reason is that COUNT(*) allows the server to pick what it believes is the optimal way to perform the query. As no data is returned, any tricks the particular DB engine has can be used. I have seen results using MySQL and Postgres that confirm this.
Posted: Sat Feb 04, 2006 10:16 pm
by John Cartwright
arborint wrote:jshpro2 wrote:Hmm, I think you missed a word in there or something (honest mistake), but do you care to elaborate on what you mean by "if you need something" and "if you don't use COUNT is always faster"?
I've had this converstation before and a number of guys have verified that "SELECT COUNT(*)" is generally (but not in every case) faster that "SELECT field". I believe the reason is that COUNT(*) allows the server to pick what it believes is the optimal way to perform the query. As no data is returned, any tricks the particular DB engine has can be used. I have seen results using MySQL and Postgres that confirm this.
My result differed from this..
I will setup a thorough benchmark in several situations hopefully tommorow since I'm about to hit the hay
Posted: Sat Feb 04, 2006 11:36 pm
by josh
Stored count vs. live counting (50,000 repetitions)
Average time per counting
Stored: 0.00011648024082184
Count: 0.00013579289913177
Total time
Stored total: 5.8240120410919
Count total: 6.7896449565887
Code: Select all
// yes, this is misspelled, too lazy to fix
$repititions=50000;
/*
mysql_query(
'
CREATE TABLE `temp_table` (
`user` INT( 5 ) NOT NULL ,
INDEX ( `user` )
);
'
) or die(mysql_error());
mysql_query(
'
CREATE TABLE `temp_users` (
`id` INT(8) NOT NULL,
`user` INT( 8 ) NOT NULL ,
`count` INT (5) NOT NULL,
INDEX ( `id` )
);
'
) or die(mysql_error());*/
/*
mysql_query('insert into `temp_users` (`id`,`user`,`count`) values (1,\'jshpro2\',2000)');
for($i=0;$i<=5;$i++) {
for($ii=0;$ii<=2000;$ii++) {
mysql_query('insert into `temp_table` (`user`) values ('.(int)$ii.')') or die(mysql_error());
}
}
*/
$start = microtime(1);
for($i=0;$i<$repititions;$i++) {
$result = mysql_query(
sprintf(
"
SELECT count(*) from `temp_table` WHERE `user` = 1
"
)
) or die(mysql_error());
$count=mysql_result($result,0,0);
mysql_free_result($result);
}
$total_count = microtime(1)-$start;
$start = microtime(1);
for($i=0;$i<$repititions;$i++) {
$result = mysql_query(
sprintf(
"
SELECT `count` from `temp_users` WHERE `id` = 1 LIMIT 1
"
)
) or die(mysql_error());
$count=mysql_result($result,0,0);
mysql_free_result($result);
}
$total_stored = microtime(1)-$start;
echo $repititions;
?>
<br />
Stored: <?=$total_stored/$repititions?><br />
Count: <?=$total_count/$repititions?>
<hr />
Stored total: <?=$total_stored?><br />
Count total: <?=$total_count?>
Give yourself a larger `temp_table` with the rows not already grouped by the user like I have them, and run the code 10 times per page and it could amount to a large difference in execution time when you're talking about a live production server.
Now what if you needed to list the top 5 posters, you're going to scan the entire `temp_table` table in some really nonsense join statement? Tell me that's not easier [and sane] then just ordering by `count` on the `temp_users` table!
Posted: Sun Feb 05, 2006 12:00 am
by d3ad1ysp0rk
It may be faster, but in any real application it would be pointless to have twice the amount of tables for everything.
Having a table `users` that stores data (id,name,pass,etc) and a `users_c` table that stores the amount of users in it seems kinda pointless to me. That's just one more table to deal with, one more query to execute when changing any table, and one more confusing piece of code for new programmers that have never seen the technique before.
Posted: Sun Feb 05, 2006 12:16 am
by feyd
generally, keeping a user count stored is yes pointless, but a user's post count is not. There are many reasons to keep a post count, and not the user count. Here's a few: your forum supports archival or pruning, thereby removing posts from the pool of those accessible. Under this circumstance, your users will likely demand that their post count should not go down just because they have an "old" post. The users list varies often enough, and is needed infrequently enough that performing a count is relatively painless. One any given thread page, if you had to dynamically calculate the post count for each user, the page could take a large amount of time to render simply due to counting, whereas it being apart of the user record, is easily found.
Having a service in the administration panels allowing an admin to resync the post count of a user or all users can be helpful, however it may hint that your script is weak enough that it may lose track of post counts.