Vlad, those tools aren't the same thing as what I'm doing because this stores data "in a cube", like OLAP (but with EAV).
We did a test run on marinas.com for 4 days. A million events were logged, 250k per day, and this is off season and we didn't even include all the things we will eventually be logging.
I've been working day and night these last few days to make adjustments, the adjustments I made in particular were
- you can pass a paramater to the TimeInterval report objects telling them to not "auto compact" (pulling up a report for a month for the first time would no longer have to traverse all the days and hours of that month, it would just use a "smart mysql query")
- wrote a Compactor class that finds the earliest and latest (delta) for traffic yet to be compacted, then returns a collection of TimeInterval_Hour and TimeInterval_Day for during those time points, and iterates those & compacts them (this will run via a cron script)... so many corner cases there. enumerating hours between two time points that lie within the same day, lie on different days, spanning multiple months, spanning multiple years, enumerating days that occur in the same month, span multiple months, etc... you get the point.
- the compacting algorithm underwent changes to make it more efficient. It was previously iterating every possible combination (power set) of attribute & event type to check for traffic. Now it does a little bit of work upfront thru mysql to figure out which attributes [values] really have any effect during a given time interval for a given event type, and will only iterate the power set for those.
In plain English I made this change:
On my slow, old windows machine its now chunking thru an hours worth of data every 10 minutes. So I'm crossing my fingers all this ugly performance tuning is behind me now. I've still quite a few incomplete tests I need to finish writing that test some corner cases (needs to be done for when we start implementing pruning). Another corner case I need to test would be triggered only if someone forgot to run the cron for over 2 yrs, as you can guess there are other fish to fry first.
Also today google code went down for anyone using SVN, was kind of annoying. See the latest revisions here:
http://code.google.com/p/socks/source/list