Help optimizing query

Questions about the MySQL, PostgreSQL, and most other databases, as well as using it with PHP can be asked here.

Moderator: General Moderators

Post Reply
User avatar
pickle
Briney Mod
Posts: 6445
Joined: Mon Jan 19, 2004 6:11 pm
Location: 53.01N x 112.48W
Contact:

Help optimizing query

Post by pickle »

Hi everyone,

I've got a killer of a query here that is taking about 37 seconds to return results. I was wondering if anyone knew of any way I could make it quicker.

Here's the query:

Code: Select all

SELECT
    cc.category,
    (sum(hrs_spent) + floor(sum(min_spent)/60)) as hrs
FROM
    assigned_challenge_categories as acc,
    challenge_categories as cc,
    followups as f
WHERE
    cc.id = acc.category AND
    f.challenge_id = acc.challenge_id
GROUP BY
    category
ORDER BY
    hrs DESC
LIMIT 
    10
And the relevant table structure

Code: Select all

desc assigned_challenge_categories;
+--------------+---------+------+-----+---------+----------------+
| Field        | Type    | Null | Key | Default | Extra          |
+--------------+---------+------+-----+---------+----------------+
| id           | int(11) |      | PRI | NULL    | auto_increment |
| challenge_id | int(11) |      |     | 0       |                |
| followup_id  | int(11) |      |     | 0       |                |
| category     | int(11) |      |     | 0       |                |
+--------------+---------+------+-----+---------+----------------+

desc challenge_categories;
+----------+--------------+------+-----+---------+-------+
| Field    | Type         | Null | Key | Default | Extra |
+----------+--------------+------+-----+---------+-------+
| id       | int(11)      |      | PRI | 0       |       |
| category | varchar(255) |      |     |         |       |
+----------+--------------+------+-----+---------+-------+

+--------------+-------------+------+-----+---------+----------------+
| Field        | Type        | Null | Key | Default | Extra          |
+--------------+-------------+------+-----+---------+----------------+
| id           | int(11)     |      | PRI | NULL    | auto_increment |
| challenge_id | int(11)     |      |     | 0       |                |
| date         | int(11)     |      |     | 0       |                |
| author       | varchar(15) |      | MUL |         |                |
| contents     | text        | YES  |     | NULL    |                |
| hrs_spent    | tinyint(3)  | YES  |     | NULL    |                |
| min_spent    | tinyint(3)  | YES  |     | NULL    |                |
+--------------+-------------+------+-----+---------+----------------+
Basically, I want to get a count of all the time spent in each assigned category, then return the top 10 most used categories. Here's the type of results I want:

Code: Select all

+------------------------------------+-------+
| category                           | hrs   |
+------------------------------------+-------+
| Dept                               | 75068 |
| Dept-Task                          | 72023 |
| Software-Server-Programming        | 13114 |
| Software-Server-Operating System   |  1732 |
| Default                            |  1694 |
| Hardware-PC                        |  1564 |
| Software-Server                    |  1332 |
| Software-Website Programming       |  1129 |
| Hardware-Audio/Visual              |  1107 |
| Software-Workstation-Configuration |  1060 |
+------------------------------------+-------+
Initially, I thought it may have been all the summing I was doing, but taking out that condition didn't speed up the query at all.

Thanks for any and all help!


[Edit:]
I should add that both the followups and assigned_challenge_categories tables have about 8000 rows each, while challenge_categories has 88
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
ody
Forum Contributor
Posts: 147
Joined: Sat Mar 27, 2004 4:42 am
Location: ManchesterUK

Post by ody »

How many rows of data are we talking about here in each table? what dbms are you using? what table engine are you using? have you thought about running the query once and keeping a master table up-to-date with the stats via triggers?
timvw
DevNet Master
Posts: 4897
Joined: Mon Jan 19, 2004 11:11 pm
Location: Leuven, Belgium

Post by timvw »

You could add a column hrs... and give it as values (sum(hrs_spent) + floor(sum(min_spent)/60))...

This way you add a little documented redundancy, but i have a good feeling it will speed up your queries (because you don't have to calculate the hrs for 8000 rows anymore..)

If that is not enough, you could also add an index on the hrs, and see if that speeds things up even more...
Roja
Tutorials Group
Posts: 2692
Joined: Sun Jan 04, 2004 10:30 pm

Post by Roja »

Change the select to directly select hrs_spent, and min_spent, and then have php do the math on the rows selectively.

PHP's math processing is much faster than SQL's, and by doing post-processing, you are only doing it on the selected rows, instead of *all* rows. It should net a considerable time improvement.

Then look at explain on the select, and see if there are any index improvements possible.
ody
Forum Contributor
Posts: 147
Joined: Sat Mar 27, 2004 4:42 am
Location: ManchesterUK

Post by ody »

Roja wrote:Change the select to directly select hrs_spent, and min_spent, and then have php do the math on the rows selectively.

PHP's math processing is much faster than SQL's, and by doing post-processing, you are only doing it on the selected rows, instead of *all* rows. It should net a considerable time improvement.

Then look at explain on the select, and see if there are any index improvements possible.
I'm very interested in your statement that PHP will out perform MySQL in math processing.. this may be true at a raw level but when selecting a lot of data from a mysql database from php the time delay will negate any speed increase.. for example:

Code: Select all

<?
/*
create table test(val int);
*/

/*
mysql> select sum(val) from test;
+-------------+
| sum(val)    |
+-------------+
| 49915709084 |
+-------------+
1 row in set (0.02 sec)
  
*/


mysql_connect('localhost', 'root', '');
mysql_select_db('test');

/* use this to populate table
for($x=0;$x<100000;$x++) {
  mysql_query(&quote;insert into test(val) values('&quote;.rand(0, 1000000).&quote;')&quote;);
} 
*/
  
  
/*
&#1111;root@beech html]# time php pc.php

real 0m0.632s
user 0m0.360s
sys 0m0.250s
*/

$sum=0;
$result=mysql_query(&quote;select val from test&quote;);
while($row=mysql_fetch_array($result)) {
 $sum+=$row&#1111;0];
}
?>
A quick and dirty example of what I mean, increase the number of rows to 1million and see what happens..

Code: Select all

mysql> select sum(val) from test;
+--------------+
| sum(val)     |
+--------------+
| 499757623409 |
+--------------+
1 row in set (0.18 sec)


&#1111;root@beech html]# time php pc.php 

real 0m5.923s
user 0m3.360s
sys 0m2.430s
You are true in saying it's quicker to post process data with PHP; if you have a very limited result set, however things start getting slow very quickly when trying to perform mathematical operations on data in PHP from MySQL.
Roja
Tutorials Group
Posts: 2692
Joined: Sun Jan 04, 2004 10:30 pm

Post by Roja »

ody wrote: You are true in saying it's quicker to post process data with PHP; if you have a very limited result set, however things start getting slow very quickly when trying to perform mathematical operations on data in PHP from MySQL.
No, you are creating a loop that makes the math reliant upon the sql call - limiting php's speed to that of mysql's.

A better test is to pull all the data, and *after* pulling the data, then perform the post processing (not in the same loop where you pull it). Then the data pull is seperate from the processing, and you can see the speed of each seperately, and without the constraint.

As a sidenote, when pulling large quantities of data, you should generally make the loop as "tight" as possible - do as little as possible in the loop so that you reduce the impact of multiple calls to the db by reducing any delay between calls. On postgres, doing so by putting multiple calls in a transaction can result in a substantail speed improvement.
ody
Forum Contributor
Posts: 147
Joined: Sat Mar 27, 2004 4:42 am
Location: ManchesterUK

Post by ody »

Roja wrote: A better test is to pull all the data, and *after* pulling the data, then perform the post processing (not in the same loop where you pull it). Then the data pull is seperate from the processing, and you can see the speed of each seperately, and without the constraint.

Code: Select all

$vals=array();
$result=mysql_query(&quote;select val from test&quote;);
while($row=mysql_fetch_array($result)) {
 array_push($vals,$row&#1111;0]);
}

$time_start=time();
$sum=0;
foreach($vals as $val) $sum+=$val;
echo time()-$time_start.&quote;\n&quote;;
This takes atleast 2seconds to process.. not to mention 7 seconds to "pull" the data.. is this what you are getting at? if not can you stick some code together to explain it :P

cheers.
User avatar
pickle
Briney Mod
Posts: 6445
Joined: Mon Jan 19, 2004 6:11 pm
Location: 53.01N x 112.48W
Contact:

Post by pickle »

Wow! So many replies over the weekend (Canadian = just had a long weekend)

Somehow, the query is now only taking 14 seconds. A 200% speed improvement is pretty good for not doing anything! Actually, I have reason to believe it was the server itself that was taking so long, but that's besides the point (I think)

Ok, now with the replies:

~ody:
- We're talking currently, about 8000 rows in 2 of the tables, 80 in the third. The 8000 size tables are only going to get bigger (probably by about 10 or 20 a day).
- The tables are MyISAM
- I haven't thought about it because I would like these queries to be up-to-the-second accurate (and they're not SO mission critical that they need to be really snappy - I'd just like them to be)

~timvw:
- I'm unsure how I would be able to add a hrs column, seeing as how that value in my query is the sum of a bunch of rows from the followups table.
- I added the index and it did absolutely nothing :)

~Roja:
- The sum(.... line only takes about 3 seconds to do. Would PHP really be able to read through 8000 lines of a result set, sum up all the minutes and hours spent, sort an 80 element array and give me the top 10 in less than 3 seconds? I've always been a believer that anything you can get MySQL to do that will eliminate post-processing, do it.




This just in..... I did some reading on what an index does and added some more. I added an index on the category columns of both the assigned_challenge_categories and challenge_categories column, as well as the challenge_id columns of the followups and assigned_challenge_category tables. This has sped up the query to 2 or 3 seconds. Not BY 2 or 3 seconds, the query now takes 2 or 3 seconds to complete. Man, I should use indices more often.


Thanks for all your help everyone, much appreciated.
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
Roja
Tutorials Group
Posts: 2692
Joined: Sun Jan 04, 2004 10:30 pm

Post by Roja »

pickle wrote: - The sum(.... line only takes about 3 seconds to do.
Interesting. I put more faith in the indexes idea, but I usually find sql math to be a slow point as well. Every situation is different, so..
pickle wrote: Would PHP really be able to read through 8000 lines of a result set, sum up all the minutes and hours spent, sort an 80 element array and give me the top 10 in less than 3 seconds?
Obviously that varies tremendously depending on the situation, server, and so on, and your mileage may vary.
pickle wrote:I've always been a believer that anything you can get MySQL to do that will eliminate post-processing, do it.
It depends. Some processing is better for mysql to do, and some is better for php. I find that sorting and limiting seems to be better in sql, and math is better in php. Thats just my experience, and is over a very specific set of scenarios (usually web-based games). Your mileage may vary, but I've found solid successes in moving math from sql to php.
pickle wrote:This has sped up the query to 2 or 3 seconds. Not BY 2 or 3 seconds, the query now takes 2 or 3 seconds to complete. Man, I should use indices more often.
Indexes are truly the jedi mind trick of SQL. Those that master it are truly powerful.
ody
Forum Contributor
Posts: 147
Joined: Sat Mar 27, 2004 4:42 am
Location: ManchesterUK

Post by ody »

Pickle, If you want to try and squeeze more speed out of this I recommend using MySQL's HEAP storage engine: http://dev.mysql.com/doc/mysql/en/memor ... ngine.html that with indexs will give you greese lightning speeds :P
User avatar
pickle
Briney Mod
Posts: 6445
Joined: Mon Jan 19, 2004 6:11 pm
Location: 53.01N x 112.48W
Contact:

Post by pickle »

True they'll be faster, but:

Code: Select all

MEMORY tables are stored in memory and use hash indexes by default. 
This makes them very fast, and very useful for creating temporary tables. 
However, when the server shuts down, all data stored in MEMORY  tables is lost.
Key words being 'temporary tables' and 'all data... is lost';

That won't work that well for me when I'm using 3 tables which are 3 of the most used in this system.

Thanks for the thought, but I'll stick to tables actually written to disc (and aren't causing me that much of a problem now anyway, since I added the indices :))
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
ody
Forum Contributor
Posts: 147
Joined: Sat Mar 27, 2004 4:42 am
Location: ManchesterUK

Post by ody »

What about the NDBCLUSTER engine?
Post Reply