[SOLVED] Slow query, Large dataset

Questions about the MySQL, PostgreSQL, and most other databases, as well as using it with PHP can be asked here.

Moderator: General Moderators

Post Reply
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

[SOLVED] Slow query, Large dataset

Post by Benjamin »

I've set aside a dedicated linux server to run this query. I've got it setup on an AMD 64bit dual core with 4gb of ram. MySQL is configured to use as much ram as it needs.

I have a large set of data that I need to calculate similarity scores on. My goal is to have all of the processing done within 2 months.

The problem is that I have not been able to get a query to run fast enough. I have a query below that is taking way too long to run. It takes over 10 hours and I need to run it a few thousand times. MySQL CPU usage is pegged at 100% which is good because it means it's not hitting the disk often.

My goal is to get it down to 1 hour or less.


Here is a stripped down version of the query which is using subselects. It's been running for 34730 seconds so far.

Code: Select all

 
UPDATE
  data_pairs dp
SET
  dp.score_all_similarity
  = dp.score_all_similarity + (1 / (1 + SQRT(POW((SELECT ((dp1.int_1 / sp1.int_2) * 100) FROM data_store dp1 WHERE dp1.object_id = dp.object_id_1 AND dp1.date = '{$date['date']}' LIMIT 1) - (SELECT ((dp2.int_1 / dp2.int_2) * 100) FROM data_store dp2 WHERE dp2.object_id = dp.object_id_2 AND dp2.date = '{$date['date']}' LIMIT 1), 2)))),
  dp.score_all_weight = dp.score_all_weight + 1;
 
Here is a version of the query that doesn't use subselects. I let it run for about 2 or 3 hours.

Code: Select all

 
UPDATE
  data_pairs dp,
  data_store dp1,
  data_store dp2
SET
  dp.score_all_similarity
    = dp.score_all_similarity + 1 / (1 + SQRT(POW(((dp1.int_1 / dp1.int_2) * 100) - ((dp2.int_1 / dp2.int_2) * 100), 2))),
  dp.score_all_weight = dp.score_all_weight + 1
WHERE
  dp1.object_id = dp.object_id_1
  AND dp2.object_id = dp.object_id_2
  AND dp1.date = '{$date['date']}'
  AND dp2.date = dp1.date;
 
There are 5.9 million records in the data_pairs table with an index on the primary key.
There are 8.8 million records in the data_store table with indexes on object_id and date.

This query updates two fields in every row of the data_pairs table by pulling two records from the data_store table.

Running the subquery by itself takes 0.0024 seconds.

Any ideas here?
Last edited by Benjamin on Sat Jul 19, 2008 3:52 pm, edited 1 time in total.
dml
Forum Contributor
Posts: 133
Joined: Sat Jan 26, 2008 2:20 pm

Re: Slow query, Large dataset

Post by dml »

Can you convert the UPDATE into a SELECT and run an EXPLAIN on it?

Code: Select all

 
EXPLAIN
SELECT * FROM
  data_pairs dp,
  data_store dp1,
  data_store dp2
WHERE
  dp1.object_id = dp.object_id_1
  AND dp2.object_id = dp.object_id_2
  AND dp1.date = '{$date['date']}'
  AND dp2.date = dp1.date;
 
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Re: Slow query, Large dataset

Post by Benjamin »

Here are the results of explain.

Code: Select all

 
id select_type table type possible_keys  key       key_len ref                       rows    Extra
1  SIMPLE      sp    ALL  NULL           NULL      NULL    NULL                      5970240      
1  SIMPLE      sp2   ref  object_id,date object_id 2       data_store.sp.object_id_2 2551    Using where
1  SIMPLE      sp1   ref  object_id,date object_id 2       data_store.sp.object_id_1 2551    Using where
 
dml
Forum Contributor
Posts: 133
Joined: Sat Jan 26, 2008 2:20 pm

Re: Slow query, Large dataset

Post by dml »

Ok, so (using phpish pseudocode) mysql is doing:

Code: Select all

 
foreach($data_pairs_table as $data_pair){ // around 5 million iterations
  $list1 = $objects_table->index_lookup("id=".$data_pair->obj1id);
  $list2 = $objects_table->index_lookup("id=".$data_pair->obj2id);
  foreach($list1 as $obj1){ // around 2500 iterations
    foreach($list2 as $obj2){ // around 2500 iterations
      if(the dates and ids match){
       // output a row (or update the similarity score for this combination)
      }
   }
  }
}
 
Given your knowledge of the data, is that a sensible way of doing it, or would you have approached it another way? There's a filter on date that looks like it might cut down the amount of data to be processed, or would it?

Around how many entries in data_store per object_id per date? If it's a small number, then an index on (object_id, date) might cut down the inner loop iterations to something a lot less than 2500*2500 = 6.5 million per data pair.
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Re: Slow query, Large dataset

Post by Benjamin »

Close to that. Each time the query runs it should update all records in the data_pairs table. 5,970,240 Million.

In php it would look like:

Code: Select all

 
foreach($data_pairs_table as $data_pair){ // around 5 million iterations
  // execute two queries
  $list1 = $objects_table->index_lookup("id=".$data_pair->obj1id);
  $list2 = $objects_table->index_lookup("id=".$data_pair->obj2id);
 
  // if a record doesn't exist for either query skip it
  if (!$list1 || !$list2) continue;
 
  // do some math
 
  // update the table
  $db->query('update data_pairs set value1='x' value2=value2+1 where pair_id = n');
}
 
Date doesn't reduce the number of records in the data_pairs table, but it does in the data_store table.

I may be able to it with PHP but I figured MySQL would be faster.
dml wrote:Around how many entries in data_store per object_id per date? If it's a small number, then an index on (object_id, date) might cut down the inner loop iterations to something a lot less than 2500*2500 = 6.5 million per data pair.
There shouldn't be any inner loop iterations. It only needs to iterate over the data_pairs table, pulling results from the data_store as it progresses.
dml
Forum Contributor
Posts: 133
Joined: Sat Jan 26, 2008 2:20 pm

Re: Slow query, Large dataset

Post by dml »

Indeed I wouldn't recommend doing it in PHP! The PHP code I wrote is a rough equivalent of what Mysql is reporting it's doing in the EXPLAIN output. It's iterating through the 5 million data pairs, it pulls 2500 records from the data store for each element of the pair, and then iterates through each combination of those records to find the matching ones (it might be doing some prefiltering, but worst case that's 6. 5 million combinations to iterate through per data pair).

From looking at the version of your code with the subselects, would it be correct to understand that you expect to pull no more than two records from data_store per data pair? That there's no more than one data_store entry per object_id per day? If that's the case, then a unique index on (object_id, date) would speed things up.
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Re: Slow query, Large dataset

Post by Benjamin »

I owe you one man thanks. This is much better than I expected.

Code: Select all

 
Query OK, 5592840 rows affected (1 min 33.50 sec)
Rows matched: 5592840  Changed: 5592840  Warnings: 0
 
Post Reply