Page 2 of 2
Re: Statistics Framework
Posted: Thu Feb 18, 2010 6:15 pm
by Christopher
I agree that it is a kind of partitioning. But it does not fit any of the standard descriptions of sharding or partitioning I have heard. They divide the data in several different ways, but not by some new records and all old records. That's why I don't think it is 'exactly' partitioning and it is definitely not sharding.
Re: Statistics Framework
Posted: Thu Feb 18, 2010 6:25 pm
by josh
Although you are partitioning the data (as any English major would tell you), it is not a partion in the RDBMS terminology.
I would consider it an archive table. its just a table where you're moving data after a certain amount of time. Partitions imply more complexity IMO.
Also heres an excerpt from the data warehouse toolkit to hopefully get the discussion back on topic:
Periodic Snapshot fact table- A type of fact table that represents business performance at the end of each regular, predictable time period. Daily snapshots and monthly snapshots are common. Snapshots are required in a number of businesses, such as insurance, where the transaction history is too complicated to be used as the basis for computing snapshots on the fly. A seperate record is placed in a periodic snapsho table each period regardless of wether any activity has taken place in the underlying account[1]. Contrast with transaction fact table[2] and accumulating snapshot table[3]
[1] still deciding on this personally for my framework. If I do this I could suffer from sparse matrixes. If I don't I could have a ton of querying just to see that there were 0 hits for a given segment. Right now I don't record zero values. I intend to use Zend_Cache as plan A for dealing with this "anomaly".
[2] I guess this would be if they just inserted 1 row per event and relied on the database for statistical analysis.
[3] this would be like if they just had a `view` colum that got incremented every so often (we know it had x amount of hits during the year, but we can't be any more specific then that sorry)
Re: Statistics Framework
Posted: Thu Feb 18, 2010 6:35 pm
by josh
Plan B is to have a "meta log", a log of which time periods have already been analyzed (if there as an entry for that time period we can safely assume any possible query contained within that range is sure to evaluate to zero)
Re: Statistics Framework
Posted: Fri Feb 19, 2010 3:15 am
by VladSun
Or you could use
RRD-like DB design

It will be very applicable, if you have a fixed number of values to log.
Re: Statistics Framework
Posted: Fri Feb 19, 2010 7:14 am
by josh
VladSun wrote:Or you could use
RRD-like DB design

It will be very applicable, if you have a fixed number of values to log.
In that article it says after it has logged enough data the new entries overwrite the old ones? My solution never needs to prune data

The idea is to be able to look back 10 yrs thru billion of hits and get accuracy down to the hour. Interesting link though
PS> I decided to make a "meta" table to avoid storing records where there are 0 values.
Re: Statistics Framework
Posted: Fri Feb 19, 2010 7:50 am
by VladSun
josh wrote:In that article it says after it has logged enough data the new entries overwrite the old ones? My solution never needs to prune data

The idea is to be able to look back 10 yrs through billion of hits and get accuracy down to the hour.
josh wrote:You might remember a child-hood memory but you don't remember every detail about it
RDD is the closest to what you said

You'll have to define several RDD - take a look at the MRTG demo:
http://www.switch.ch/network/operation/ ... eant2.html (it uses RRD).
It has daily graphics with 5-minute accuracy, weekly graphics with 2-hour accuracy, etc. ...
Of course, one can always define a 10-year (or 100-year) period - so, data is not just pruned - it is saved at a lower accuracy.
The longer the period is, the lower accuracy you get - similarly as a bio-memory works
PS:

Your child-hood memory example is not good - many people will remember much more details about events in their child-hood, than details about events that took place a week ago (e.g.)

Re: Statistics Framework
Posted: Fri Mar 19, 2010 8:22 pm
by josh
Vlad, those tools aren't the same thing as what I'm doing because this stores data "in a cube", like OLAP (but with EAV).
We did a test run on marinas.com for 4 days. A million events were logged, 250k per day, and this is off season and we didn't even include all the things we will eventually be logging.
I've been working day and night these last few days to make adjustments, the adjustments I made in particular were
- you can pass a paramater to the TimeInterval report objects telling them to not "auto compact" (pulling up a report for a month for the first time would no longer have to traverse all the days and hours of that month, it would just use a "smart mysql query")
- wrote a Compactor class that finds the earliest and latest (delta) for traffic yet to be compacted, then returns a collection of TimeInterval_Hour and TimeInterval_Day for during those time points, and iterates those & compacts them (this will run via a cron script)... so many corner cases there. enumerating hours between two time points that lie within the same day, lie on different days, spanning multiple months, spanning multiple years, enumerating days that occur in the same month, span multiple months, etc... you get the point.
- the compacting algorithm underwent changes to make it more efficient. It was previously iterating every possible combination (power set) of attribute & event type to check for traffic. Now it does a little bit of work upfront thru mysql to figure out which attributes [values] really have any effect during a given time interval for a given event type, and will only iterate the power set for those.
In plain English I made this change:
On my slow, old windows machine its now chunking thru an hours worth of data every 10 minutes. So I'm crossing my fingers all this ugly performance tuning is behind me now. I've still quite a few incomplete tests I need to finish writing that test some corner cases (needs to be done for when we start implementing pruning). Another corner case I need to test would be triggered only if someone forgot to run the cron for over 2 yrs, as you can guess there are other fish to fry first.
Also today google code went down for anyone using SVN, was kind of annoying. See the latest revisions here:
http://code.google.com/p/socks/source/list
Re: Statistics Framework
Posted: Sun Mar 21, 2010 4:47 am
by josh
I found this on wikipedia which kind of explains my observations and maybe better communicates why I've been having to make the adjustments I've been making.
Posting it here because others may find it interesting, and also for my own future reference.
Linking cubes and sparsity
The commercial OLAP products have different methods of creating the cubes and hypercubes and of linking cubes and hypercubes (see Types of OLAP in the article on OLAP.)
Linking cubes is a method of overcoming sparsity. Sparsity arises when not every cell in the cube is filled with data and so valuable processing time is taken by effectively adding up zeros. For example revenues may be available for each customer and product but cost data may not be available with this amount of analysis. Instead of creating a sparse cube, it is sometimes better to create another separate, but linked, cube in which a sub-set of the data can be analyzed into great detail. The linking ensures that the data in the cubes remain consistent.
I'd welcome any advice or critiquing!
From the sources on the wiki page I learned another couple terms that apply to me. "Cross tabulation" or "pivot table". Example:
Its also entirely coincidental but interesting the example crosses over with automotive, which is my other line of work
It also interesting, collecting & being able to pull up this data on a milliseconds notice, opens up lots of doors in terms of statistical analysis (read up on "chi's square", or "Bayesian probability" for example)
Re: Statistics Framework
Posted: Mon Mar 22, 2010 10:02 pm
by josh
I was able to avoid 100s of thousands of "0 result queries" (sparsity) with another refactoring. By overloading the strategy for the Day object to re-use some of the work that was done when compacting the Hour.
Sample data:
1 million events logged over 4 days
Test machine is 4yrs old spare server, only 1GB of ram...
Before:
Compact a Hour interval took 2 minutes
Compact a Day interval took 17 HOURS
After
Compact a Hour took 2 minutes
Compact a Day took 1 minute
http://code.google.com/p/socks/source/detail?r=225#
Got the idea from one of the articles in that wiki.
So now the only permutations or power sets that have to be done is when compacting the hour interval. Because it is constrained to an hour's worth of data at a time, for one event type at a time, the number of possible combination will hopefully tend to be smaller.
Once that is done the SQL engine can do the heavy lifting of "collapsing" the data, or rolling it up into the daily summaries. I'm going to assume the strategy for going from the day to the month will be the same.
I think its pretty interesting I just avoided the whole sparsity problem in the flick of a wrist, but hoping that writing this wont jinx me.
Re: Statistics Framework
Posted: Tue Jul 06, 2010 2:42 pm
by josh
I've launched my first implementation with this. At times I thought I was trying to do something impossible and wanted to give up... I kept going and going and kept thinking of ways to make it faster like using a serialized string to store the attributes instead of the clumsy EAV theory.
On a quad xeon with a nice amount of RAM we have logged some odd 30 Million "actions" in the last few months. The reporting application is able to generate reports in an average of under 10 seconds no matter how much stuff we log. These benchmarks are the same whether I want all traffic, or just traffic with certain "tags" or attributes (that was part of the design from the beginning).
For example to generate this report the query only had to look at 4 rows. Not 14.5 Million.
The source tree has been re-factored & has evolved since the last time I updated:
http://code.google.com/p/socks/source/b ... y/PhpStats (files that end in Test.php are unit test files)
For example if we log an action with the attributes "a" set to "1", and "b" set to "2", the algorithm generates the following string for indexing purposes:
1;
2; That way I don't need to join tables that are 100 Million rows large in order to constrain on some attribute(s).
Re: Statistics Framework
Posted: Tue Jul 06, 2010 4:49 pm
by Jonah Bron
So is this the stuff you learn when you get a Computer Science degree?
Re: Statistics Framework
Posted: Tue Jul 06, 2010 5:09 pm
by Eran
Actually, no. This is the stuff you learn when you deal with real-world concerns
Re: Statistics Framework
Posted: Tue Jul 06, 2010 5:11 pm
by John Cartwright
Jonah Bron wrote:
So is this the stuff you learn when you get a Computer Science degree?
//derailment
You would be suprised how little some of the CS programs teach you
//end derailment
Re: Statistics Framework
Posted: Tue Jul 06, 2010 5:39 pm
by Jonah Bron
Interesting.
</offtopic>
Re: Statistics Framework
Posted: Tue Jul 06, 2010 9:38 pm
by josh
Actually I had no schooling in programming. This is what you get when you set your mind to your goals
