Statistics Framework

Coding Critique is the place to post source code for peer review by other members of DevNetwork. Any kind of code can be posted. Code posted does not have to be limited to PHP. All members are invited to contribute constructive criticism with the goal of improving the code. Posted code should include some background information about it and what areas you specifically would like help with.

Popular code excerpts may be moved to "Code Snippets" by the moderators.

Moderator: General Moderators

josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Re: Statistics Framework

Post by josh »

VladSun wrote:Or you could use RRD-like DB design :P
It will be very applicable, if you have a fixed number of values to log.
In that article it says after it has logged enough data the new entries overwrite the old ones? My solution never needs to prune data ;-) The idea is to be able to look back 10 yrs thru billion of hits and get accuracy down to the hour. Interesting link though

PS> I decided to make a "meta" table to avoid storing records where there are 0 values.
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Statistics Framework

Post by VladSun »

josh wrote:In that article it says after it has logged enough data the new entries overwrite the old ones? My solution never needs to prune data ;-) The idea is to be able to look back 10 yrs through billion of hits and get accuracy down to the hour.
josh wrote:You might remember a child-hood memory but you don't remember every detail about it
RDD is the closest to what you said :)
You'll have to define several RDD - take a look at the MRTG demo: http://www.switch.ch/network/operation/ ... eant2.html (it uses RRD).

It has daily graphics with 5-minute accuracy, weekly graphics with 2-hour accuracy, etc. ...
Of course, one can always define a 10-year (or 100-year) period - so, data is not just pruned - it is saved at a lower accuracy.

The longer the period is, the lower accuracy you get - similarly as a bio-memory works :)

PS: :P :P
Your child-hood memory example is not good - many people will remember much more details about events in their child-hood, than details about events that took place a week ago (e.g.) ;)
There are 10 types of people in this world, those who understand binary and those who don't
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Re: Statistics Framework

Post by josh »

Vlad, those tools aren't the same thing as what I'm doing because this stores data "in a cube", like OLAP (but with EAV).

We did a test run on marinas.com for 4 days. A million events were logged, 250k per day, and this is off season and we didn't even include all the things we will eventually be logging.

I've been working day and night these last few days to make adjustments, the adjustments I made in particular were

- you can pass a paramater to the TimeInterval report objects telling them to not "auto compact" (pulling up a report for a month for the first time would no longer have to traverse all the days and hours of that month, it would just use a "smart mysql query")

- wrote a Compactor class that finds the earliest and latest (delta) for traffic yet to be compacted, then returns a collection of TimeInterval_Hour and TimeInterval_Day for during those time points, and iterates those & compacts them (this will run via a cron script)... so many corner cases there. enumerating hours between two time points that lie within the same day, lie on different days, spanning multiple months, spanning multiple years, enumerating days that occur in the same month, span multiple months, etc... you get the point.

- the compacting algorithm underwent changes to make it more efficient. It was previously iterating every possible combination (power set) of attribute & event type to check for traffic. Now it does a little bit of work upfront thru mysql to figure out which attributes [values] really have any effect during a given time interval for a given event type, and will only iterate the power set for those.

In plain English I made this change:
Image

On my slow, old windows machine its now chunking thru an hours worth of data every 10 minutes. So I'm crossing my fingers all this ugly performance tuning is behind me now. I've still quite a few incomplete tests I need to finish writing that test some corner cases (needs to be done for when we start implementing pruning). Another corner case I need to test would be triggered only if someone forgot to run the cron for over 2 yrs, as you can guess there are other fish to fry first.

Also today google code went down for anyone using SVN, was kind of annoying. See the latest revisions here:
http://code.google.com/p/socks/source/list
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Re: Statistics Framework

Post by josh »

I found this on wikipedia which kind of explains my observations and maybe better communicates why I've been having to make the adjustments I've been making.

Posting it here because others may find it interesting, and also for my own future reference.
Linking cubes and sparsity

The commercial OLAP products have different methods of creating the cubes and hypercubes and of linking cubes and hypercubes (see Types of OLAP in the article on OLAP.)

Linking cubes is a method of overcoming sparsity. Sparsity arises when not every cell in the cube is filled with data and so valuable processing time is taken by effectively adding up zeros. For example revenues may be available for each customer and product but cost data may not be available with this amount of analysis. Instead of creating a sparse cube, it is sometimes better to create another separate, but linked, cube in which a sub-set of the data can be analyzed into great detail. The linking ensures that the data in the cubes remain consistent.
I'd welcome any advice or critiquing!

From the sources on the wiki page I learned another couple terms that apply to me. "Cross tabulation" or "pivot table". Example:

Image

Its also entirely coincidental but interesting the example crosses over with automotive, which is my other line of work ;-)

It also interesting, collecting & being able to pull up this data on a milliseconds notice, opens up lots of doors in terms of statistical analysis (read up on "chi's square", or "Bayesian probability" for example)
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Re: Statistics Framework

Post by josh »

I was able to avoid 100s of thousands of "0 result queries" (sparsity) with another refactoring. By overloading the strategy for the Day object to re-use some of the work that was done when compacting the Hour.

Sample data:
1 million events logged over 4 days

Test machine is 4yrs old spare server, only 1GB of ram...

Before:
Compact a Hour interval took 2 minutes
Compact a Day interval took 17 HOURS

After
Compact a Hour took 2 minutes
Compact a Day took 1 minute

http://code.google.com/p/socks/source/detail?r=225#

Got the idea from one of the articles in that wiki.

So now the only permutations or power sets that have to be done is when compacting the hour interval. Because it is constrained to an hour's worth of data at a time, for one event type at a time, the number of possible combination will hopefully tend to be smaller.

Once that is done the SQL engine can do the heavy lifting of "collapsing" the data, or rolling it up into the daily summaries. I'm going to assume the strategy for going from the day to the month will be the same.

I think its pretty interesting I just avoided the whole sparsity problem in the flick of a wrist, but hoping that writing this wont jinx me.
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Re: Statistics Framework

Post by josh »

I've launched my first implementation with this. At times I thought I was trying to do something impossible and wanted to give up... I kept going and going and kept thinking of ways to make it faster like using a serialized string to store the attributes instead of the clumsy EAV theory.

On a quad xeon with a nice amount of RAM we have logged some odd 30 Million "actions" in the last few months. The reporting application is able to generate reports in an average of under 10 seconds no matter how much stuff we log. These benchmarks are the same whether I want all traffic, or just traffic with certain "tags" or attributes (that was part of the design from the beginning).

Image

For example to generate this report the query only had to look at 4 rows. Not 14.5 Million.

The source tree has been re-factored & has evolved since the last time I updated:
http://code.google.com/p/socks/source/b ... y/PhpStats (files that end in Test.php are unit test files)

For example if we log an action with the attributes "a" set to "1", and "b" set to "2", the algorithm generates the following string for indexing purposes: :a:1;:b:2; That way I don't need to join tables that are 100 Million rows large in order to constrain on some attribute(s).
User avatar
Jonah Bron
DevNet Master
Posts: 2764
Joined: Thu Mar 15, 2007 6:28 pm
Location: Redding, California

Re: Statistics Framework

Post by Jonah Bron »

:crazy:
So is this the stuff you learn when you get a Computer Science degree?
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Statistics Framework

Post by Eran »

Actually, no. This is the stuff you learn when you deal with real-world concerns
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Re: Statistics Framework

Post by John Cartwright »

Jonah Bron wrote::crazy:
So is this the stuff you learn when you get a Computer Science degree?
//derailment

You would be suprised how little some of the CS programs teach you ;)

//end derailment
User avatar
Jonah Bron
DevNet Master
Posts: 2764
Joined: Thu Mar 15, 2007 6:28 pm
Location: Redding, California

Re: Statistics Framework

Post by Jonah Bron »

Interesting.

</offtopic>
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Re: Statistics Framework

Post by josh »

Actually I had no schooling in programming. This is what you get when you set your mind to your goals ;-)
Post Reply