Page 1 of 1

Query optimization / indexes (indices)

Posted: Sun Jul 26, 2009 8:54 am
by wispa
I'm hoping someone might be able to help me, I'm pretty new to using indexes on databases & I'm now having to work with a pretty large database (40gb) so I'm being thrown in at the deep end.

The data table currently holds 870,969,236 records and this is an example query I'm running:

Code: Select all

SELECT 
    data.*, 
    DATE_FORMAT(data.reading_datetime, '%Y-%m') AS chart_date, 
    units_of_measurement.unit AS unitname, 
    dataset.name, dataset.parameter, 
    AVG(data.value) AS average, 
    STD(data.value) AS standard_deviation, 
    COUNT(data.value) AS num_of_results 
FROM 
    data 
LEFT JOIN 
    dataset ON data.dataset_id = dataset.id 
LEFT JOIN 
    units_of_measurement ON dataset.uom_id = units_of_measurement.id 
WHERE 
    reading_datetime > '2000-07-16' AND 
    reading_datetime < '2009-07-25' AND 
    ( 
        dataset_id = 4 OR 
        dataset_id = 8 OR 
        dataset_id = 13 OR 
        dataset_id = 24
    ) 
GROUP BY 
    data.longitude, 
    data.latitude, 
    data.dataset_id , 
    DATE_FORMAT(data.reading_datetime, '%Y'), 
    DATE_FORMAT(data.reading_datetime, '%m') 
ORDER BY 
    data.reading_datetime ASC


I've added one index to the data table of (dataset_id, reading_datetime) which has reduced the rows being used down from 870969236 to 90060802 but I'd like to get this down further if possible.

I have tried adding another index for the groups (longitude, latitude, dataset_id, reading_datetime) but this didn't reduce the rows being used at all.

If anyone has any suggestions or methods on I could optimize the table, I'd love to hear them.

Thanks in advance

Re: Query optimization / indexes (indices)

Posted: Tue Aug 04, 2009 1:06 pm
by tr0gd0rr
I'm no expert, but it seems like you need data.value and data.reading_datetime by itself. When calculating average and standard deviation, the query is probably going to have to read all values.

With that much data, you should probably calculate average and std dev as you go along. For example, every time you insert a data point, update an auxiliary table.

Say you have 5 valid values for data.value:

Code: Select all

value | count
1     |  1200000
2     | 30000000
3     | 38000000
4     | 28000000
5     | 18000000 
Then you can figure out average and std dev in php after querying this small result set.

If you have a prohibitively large range for data.value, you could make a table like this to update each time a data record is inserted:

Code: Select all

count     | avg          | stddev
880000000 | 42.12345678  | 5.12345678 
I'm not sure how to calculate stddev in either of those cases, but I think it is possible.

As far as time efficiency on inserts to `data`, it sounds like you almost always write to `data` but rarely read. I've heard the rule of thumb is that each index adds a factor of three to the insert. So if you have no indexes, the insert might take 0.01s. With two indexes the insert might take 0.07s.

Anyway, I'm saying that updating one of these small tables on each insert into `data` would probably take a similar amount of time as inserting records into `data` with 2 or 3 indexes.

Re: Query optimization / indexes (indices)

Posted: Tue Aug 04, 2009 3:06 pm
by Eran
Please run EXPLAIN on that query and post the results here. Also post the structure of your tables