I have a large set of data that I need to calculate similarity scores on. My goal is to have all of the processing done within 2 months.
The problem is that I have not been able to get a query to run fast enough. I have a query below that is taking way too long to run. It takes over 10 hours and I need to run it a few thousand times. MySQL CPU usage is pegged at 100% which is good because it means it's not hitting the disk often.
My goal is to get it down to 1 hour or less.
Here is a stripped down version of the query which is using subselects. It's been running for 34730 seconds so far.
Code: Select all
UPDATE
data_pairs dp
SET
dp.score_all_similarity
= dp.score_all_similarity + (1 / (1 + SQRT(POW((SELECT ((dp1.int_1 / sp1.int_2) * 100) FROM data_store dp1 WHERE dp1.object_id = dp.object_id_1 AND dp1.date = '{$date['date']}' LIMIT 1) - (SELECT ((dp2.int_1 / dp2.int_2) * 100) FROM data_store dp2 WHERE dp2.object_id = dp.object_id_2 AND dp2.date = '{$date['date']}' LIMIT 1), 2)))),
dp.score_all_weight = dp.score_all_weight + 1;
Code: Select all
UPDATE
data_pairs dp,
data_store dp1,
data_store dp2
SET
dp.score_all_similarity
= dp.score_all_similarity + 1 / (1 + SQRT(POW(((dp1.int_1 / dp1.int_2) * 100) - ((dp2.int_1 / dp2.int_2) * 100), 2))),
dp.score_all_weight = dp.score_all_weight + 1
WHERE
dp1.object_id = dp.object_id_1
AND dp2.object_id = dp.object_id_2
AND dp1.date = '{$date['date']}'
AND dp2.date = dp1.date;
There are 8.8 million records in the data_store table with indexes on object_id and date.
This query updates two fields in every row of the data_pairs table by pulling two records from the data_store table.
Running the subquery by itself takes 0.0024 seconds.
Any ideas here?