Page 1 of 1

large query going very slow

Posted: Wed May 02, 2007 9:32 pm
by John Cartwright
In a nutshell I'm developing some statistics to track our users, based upon their clicks and impressions. Now, this table has gotten extremely large, and is growing very rapidly. It is approaching 30 million rows, and the query has become excessively slow

Code: Select all

SELECT COUNT(`click_views`.`id`) AS `clickviews`, `view_type`, `sites`.`id`, `site_name`
FROM `click_views` 
INNER JOIN `sites` ON `click_views`.`site_id` = `sites`.`id`
WHERE `click_views`.`af_id` > 0
AND `click_views`.`view_type` != ''
GROUP BY `sites`.`id`, `view_type`
Now working on my local server with a reduced amount of data, approximatly 5 million rows, the query takes about 10 seconds. With the 30 million rows, it takes about 300 seconds.

Here is what EXPLAIN of the SQL shows

Code: Select all

id 	select_type 	table 	      type   possible_keys 	         key 	 key_len 	ref 	          rows 	Extra
1 	 SIMPLE 	     sites 	      ALL 	PRIMARY 	               NULL 	NULL 	   NULL 	         10 	  Using temporary; Using filesort
1 	 SIMPLE 	     click_views    ref 	site_id,view_type,af_id   site_id 1 	      spiders.sites.id 316368  Using where
and here is my table structure of the click_views table

Code: Select all

CREATE TABLE `click_views` (
  `id` int(11) NOT NULL auto_increment,
  `ip` varchar(16) collate utf8_unicode_ci default NULL,
  `af_id` int(11) NOT NULL default '0',
  `campaign_id` int(11) NOT NULL default '0',
  `url_id` int(11) default NULL,
  `date` datetime NOT NULL default '0000-00-00 00:00:00',
  `banner_id` mediumint(9) NOT NULL default '0',
  `planID` mediumint(9) NOT NULL default '0',
  `view_type` enum('v','c') collate utf8_unicode_ci NOT NULL default 'v',
  `site_id` tinyint(4) NOT NULL default '0',
  `redirect_id` int(11) NOT NULL default '0',
  `sub` varchar(55) collate utf8_unicode_ci NOT NULL default 'N/A',
  `galleryname` varchar(24) collate utf8_unicode_ci NOT NULL default '',
  PRIMARY KEY  (`id`),
  KEY `redirect_id` (`redirect_id`),
  KEY `ip` (`ip`),
  KEY `site_id` (`site_id`),
  KEY `campaign_id` (`campaign_id`),
  KEY `view_type` (`view_type`),
  KEY `date` (`date`),
  KEY `af_id` (`af_id`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=2214611 ;
and heres the sites table

Code: Select all

CREATE TABLE `sites` (
  `id` int(5) NOT NULL auto_increment,
  `site_name` varchar(15) collate utf8_unicode_ci NOT NULL default '',
  `commission` varchar(255) collate utf8_unicode_ci NOT NULL default '',
  `active` tinyint(1) default '1',
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=11 ;
Now I am no database guru, although for the love of me cannot figure out a way to get this query to run fast. Am I being unrealistic about wanting to COUNT() 30 million rows fast?

Posted: Thu May 03, 2007 3:09 am
by onion2k
I'm not 100% sure, but creating an index on `click_views`.`site_id` should help as that's what it's joining on in the second part of the explain. Possibly a 2 column index on `click_views`.`id`and `click_views`.`site_id`?

Posted: Thu May 03, 2007 11:11 am
by John Cartwright
They have already been set to keys.
:(

Posted: Thu May 03, 2007 4:02 pm
by onion2k
When did you last run "OPTIMIZE TABLE `click_views`"? If the answer is 'never', run it.

Posted: Thu May 03, 2007 5:33 pm
by pickle
300 seconds = 5 minutes. For 30 million rows that doesn't sound all that long. I don't know how old this information is, but I've heard that MySQL has trouble joining tables when at least one of them is > 1 million rows.

If you've got everything indexed, I'm not sure what else you can do. I've noticed that for some tables, it's actually faster to do the join logic in code rather than in MySQL. I can't explain why, but when I changed some of my queries, I saw a > 200% speed up.

Posted: Fri May 04, 2007 1:44 am
by AKA Panama Jack
You are pretty much screwed even with indexing because your query is going to be doing table scans because of the GROUP BY coupled with the INNER JOIN you are using. And performing table scans of a 30 million record table is going to be very slow and CPU intensive.

Posted: Fri May 04, 2007 1:42 pm
by Weirdan
according to mysql manual:
The most important preconditions for using indexes for GROUP BY are that all GROUP BY columns reference attributes from the same index, and that the index stores its keys in order (for example, this is a BTREE index and not a HASH index). Whether use of temporary tables can be replaced by index access also depends on which parts of an index are used in a query, the conditions specified for these parts, and the selected aggregate functions.
your best bet would be to group by `click_views`.`id`, `view_type` (not the `sites`.`id`!) and add an index over these two columns. Using count(*) instead of count(`click_views`.`id`) may help too.

Posted: Fri May 04, 2007 1:48 pm
by John Cartwright
thanks for the input guys, I'll let you know how it goes as soon as I get some positive results :)

Posted: Fri May 04, 2007 5:08 pm
by RobertGonzalez
What about changing the table type to InnoDB from MyISAM? Would that help?

Posted: Fri May 04, 2007 5:25 pm
by AKA Panama Jack
Everah wrote:What about changing the table type to InnoDB from MyISAM? Would that help?
Probably depends upon how much memory he has availible and if it is a dedicated server. Innodb can be faster but that is because it tries to load the entire table into memory. Myisam just keeps the indexes in memory. With 30 million records that may not be possible.

Posted: Fri May 04, 2007 5:53 pm
by John Cartwright
Just so you know, we are running a dedicated server.

I'm just finishing up the project for the time being as is without making any modifications to the database itself, as there are many other sites reliant on the data as well. Don't really want to risk any problems with the other sites.

If you can think of any more optimization tips I'm all ears, however it will be a couple days before I can let you guys know my findings.

Posted: Fri May 04, 2007 6:29 pm
by dreamscape
If you think you are going to be able to efficiently analyze and mine data in an online realtime process, think again.

You will be much better off to generate report data in an offline process, and then use that data to generate your online reports.

If it is taking about 5 minutes to run, updating the report data every 1 hour or so would be doable.