[SOLVED] Slow query, Large dataset
Posted: Sat Jul 19, 2008 12:41 pm
I've set aside a dedicated linux server to run this query. I've got it setup on an AMD 64bit dual core with 4gb of ram. MySQL is configured to use as much ram as it needs.
I have a large set of data that I need to calculate similarity scores on. My goal is to have all of the processing done within 2 months.
The problem is that I have not been able to get a query to run fast enough. I have a query below that is taking way too long to run. It takes over 10 hours and I need to run it a few thousand times. MySQL CPU usage is pegged at 100% which is good because it means it's not hitting the disk often.
My goal is to get it down to 1 hour or less.
Here is a stripped down version of the query which is using subselects. It's been running for 34730 seconds so far.
Here is a version of the query that doesn't use subselects. I let it run for about 2 or 3 hours.
There are 5.9 million records in the data_pairs table with an index on the primary key.
There are 8.8 million records in the data_store table with indexes on object_id and date.
This query updates two fields in every row of the data_pairs table by pulling two records from the data_store table.
Running the subquery by itself takes 0.0024 seconds.
Any ideas here?
I have a large set of data that I need to calculate similarity scores on. My goal is to have all of the processing done within 2 months.
The problem is that I have not been able to get a query to run fast enough. I have a query below that is taking way too long to run. It takes over 10 hours and I need to run it a few thousand times. MySQL CPU usage is pegged at 100% which is good because it means it's not hitting the disk often.
My goal is to get it down to 1 hour or less.
Here is a stripped down version of the query which is using subselects. It's been running for 34730 seconds so far.
Code: Select all
UPDATE
data_pairs dp
SET
dp.score_all_similarity
= dp.score_all_similarity + (1 / (1 + SQRT(POW((SELECT ((dp1.int_1 / sp1.int_2) * 100) FROM data_store dp1 WHERE dp1.object_id = dp.object_id_1 AND dp1.date = '{$date['date']}' LIMIT 1) - (SELECT ((dp2.int_1 / dp2.int_2) * 100) FROM data_store dp2 WHERE dp2.object_id = dp.object_id_2 AND dp2.date = '{$date['date']}' LIMIT 1), 2)))),
dp.score_all_weight = dp.score_all_weight + 1;
Code: Select all
UPDATE
data_pairs dp,
data_store dp1,
data_store dp2
SET
dp.score_all_similarity
= dp.score_all_similarity + 1 / (1 + SQRT(POW(((dp1.int_1 / dp1.int_2) * 100) - ((dp2.int_1 / dp2.int_2) * 100), 2))),
dp.score_all_weight = dp.score_all_weight + 1
WHERE
dp1.object_id = dp.object_id_1
AND dp2.object_id = dp.object_id_2
AND dp1.date = '{$date['date']}'
AND dp2.date = dp1.date;
There are 8.8 million records in the data_store table with indexes on object_id and date.
This query updates two fields in every row of the data_pairs table by pulling two records from the data_store table.
Running the subquery by itself takes 0.0024 seconds.
Any ideas here?