Page 1 of 2

Statistics Framework

Posted: Thu Feb 11, 2010 6:29 pm
by josh
http://code.google.com/p/socks/

"It becomes inefficient to log every hit in your application. A "rollup" is a table that aggregates statistics for performance reasons. "

critique 8)

Re: Statistics Framework

Posted: Thu Feb 11, 2010 6:32 pm
by Eran
Hard to critique without the source code

Re: Statistics Framework

Posted: Fri Feb 12, 2010 3:24 am
by josh
Click "source" and then "browse" or do an anonymous checkout.

And I'll copy this here from the homepage:
The stats framework itself can be found in library/PhpStats

See HTML version of documentation in library/PhpStats/documentation
Also example "reporting application" using Zend: http://code.google.com/p/socks/source/b ... Code/Stats
Performance tests http://code.google.com/p/socks/source/b ... erformance
Direct link to code itself including unit tests: http://code.google.com/p/socks/source/b ... y/PhpStats

Also in case it isn't obvious from the documentation you use that Logger thing to record hits and the Month, Day and Year classes to generate reports. When you generate reports it does the rollup behind the scenes.

I like how google code allows the community to attach their comments to the code itself. (http://code.google.com/p/support/wiki/CodeReviews)

Re: Statistics Framework

Posted: Fri Feb 12, 2010 7:27 am
by Eran
I tried doing a checkout and I got the following error -

Re: Statistics Framework

Posted: Fri Feb 12, 2010 9:33 am
by josh
Thanks, submitted bug to google. http://code.google.com/p/support/issues/detail?id=3599

Take off the /trunk and it will work.

Re: Statistics Framework

Posted: Wed Feb 17, 2010 7:49 pm
by Eran
I haven't had the time to go over the code, but I'm very interested in this kind of library. Can you explain some of the advantegous and show some real-world examples of how this can simplify things? (I saw you mentioned an algorithm for aggregate functions on many rows).
Also, you have some HTML entities showing on the google site

Re: Statistics Framework

Posted: Wed Feb 17, 2010 10:15 pm
by josh
The idea of the framework is if you ran a classified ads site, to be able to show your users how many hits their ads receive.

Google doesn't support full HTML. Their problem not mine. If someone wants to port it to wiki markup I'd give you 100 thank yous.

The example:

LOG TABLE
'firefox', '127.0.0.1', '800x600'
'IE', '127.0.0.2', '1024x769'
(imagine the log table went on for a billion more rows)

Much more efficient to store something like this:

Month,day,year,screen_rez,count
'Feb', '1', '2009', '800x600', '52'

Instead of storing 52 individual rows, we store 1 single row that describes that "segment" of our traffic. This is called a rollup.

Lots of tool vendors offer rollups (oracle) but seeing as those are not open source and my solution IS, I think there is a reason for it to exist ;-) Plus a corp. just licensed (hired me to implement it on their site) which more than payed for the time its taken to create ;-)

So heres how I'd sum it up in 1 sentence.

Instead of running a COUNT query on a billion rows, its more efficient to run a SUM query on a million rows that each contain the VALUE "1,000".


I am still researching an anomaly I have found. See this idea is a combo of EAV & rollups (oracle). There is an anomaly.... Its hard to describe but as you roll up the traffic there are more "power sets" (permutations of "attributes" http://en.wikipedia.org/wiki/Power_set meaning rolling up from day to month increases the number of rows). But it appears to be an anomaly that occurs in proportion to the complexity of attributes used (volume of traffic does not affect performance... thats the goal of the framework at the end of the day)... also attributes in my framwork by design are to be used scarcely (eg. to segment traffic so each user can get a report of just their traffic)

Re: Statistics Framework

Posted: Thu Feb 18, 2010 2:40 am
by Eran
MySQL has a somewhat different definition to a rollup (ie, it still happens in realtime) - http://dev.mysql.com/doc/refman/5.0/en/ ... fiers.html
To me it sounds like you're talking more about data-warehousing techniques, such as a star schema where some of the aggregated data is kept in a denormalized form in additional dimension tables. What I'm interested it, is how does your library help with such data aside from suggesting a more performant schema

Re: Statistics Framework

Posted: Thu Feb 18, 2010 11:42 am
by josh
pytrin wrote:To me it sounds like you're talking more about data-warehousing techniques, such as a star schema where some of the aggregated data is kept in a denormalized form in additional dimension tables. What I'm interested it, is how does your library help with such data aside from suggesting a more performant schema
It is based on the ideas of a star schema, right now it uses EAV so it can support varying numbers of "segments" with one schema. More like Oracle's cube rollup... I guess? http://www.psoug.org/reference/rollup.html

I guess technically its not a rollup as the Mysql/Oracle describe it. My solution differs in that once we've "rolled up" the stats for the Month of Jan, and the user asks to see traffic for all the year at the "grain" of 'month', it would not have a need to look into the log table, or even the hours table, or even the days table. Once it calculates the traffic for January it need not calculate it again (unless the month of January is not over).

If you don't have over 10M rows in your log table than my solution is probably not as performant as just having a log table and running SUM or ROLLUP commands on it. Once you have millions of rows its a no brainer to see that:

1+1+1+1+1+1+1+1=
is slower than doing
3+3+3=
which is even slower than just doing
9=

I know people who run large sites who just prune their data over time. This is the problem I am directly addressing. No one should ever have to prune their data.

Re: Statistics Framework

Posted: Thu Feb 18, 2010 11:54 am
by Eran
Yes, storing summaries is the most common way to scale log / statistical data. I'll dig in further to see how this is done automatically by your library.
I know people who run large sites who just prune their data over time. This is the problem I am directly addressing. No one should ever have to prune their data.
How does this avoid pruning? wouldn't you like to still remove the original data from the normalized tables to prevent them from becoming too large?

Re: Statistics Framework

Posted: Thu Feb 18, 2010 2:17 pm
by Christopher
josh wrote:I guess technically its not a rollup as the Mysql/Oracle describe it. My solution differs in that once we've "rolled up" the stats for the Month of Jan, and the user asks to see traffic for all the year at the "grain" of 'month', it would not have a need to look into the log table, or even the hours table, or even the days table. Once it calculates the traffic for January it need not calculate it again (unless the month of January is not over).
I have been following this discussion, but have not looked closely. In reading what you said above, it almost sounds like a query cache that generates calculated, historic data (i.e. data that will not change) and then uses it for future queries that require the same calculated data. I don't think that is what you are doing, but the idea of generating the data on-demand and storing it in an EAV is interesting. It is something that could perhaps be transparent to the Domain Layer -- sitting below it, but above the actual data access.
josh wrote:I know people who run large sites who just prune their data over time. This is the problem I am directly addressing. No one should ever have to prune their data.
I have in the past moved older data to an archive table (with the same schema as the "current" table) and then added a JOIN to the SQL if they requested a date range before the start date of the "current" table. That made most reports pretty fast, but still allowed reports on older data.

Re: Statistics Framework

Posted: Thu Feb 18, 2010 2:20 pm
by Eran
I have in the past moved older data to an archive table (with the same schema as the "current" table) and then added a JOIN to the SQL if they requested a date range before the start date of the "current" table. That made most reports pretty fast, but still allowed reports on older data.
This technique is called partitioning (or sharding), and is now supported natively in recent versions of MySQL. You can declare the column to split the table by, the ranges for each partition, and MySQL will select the partition for you so you don't need to manually join.

Re: Statistics Framework

Posted: Thu Feb 18, 2010 2:23 pm
by josh
pytrin wrote:How does this avoid pruning? wouldn't you like to still remove the original data from the normalized tables to prevent them from becoming too large?
Yes, it is pruning in the same way your brain prunes memories. You might remember a child-hood memory but you don't remember every detail about it.

The trade off to using my "rollups" is you won't be able to go back and pull up a raw log of hits (unless you keep such a log on your own). However it does take IP into account and stores both the SUM and the SUM of only unique by hostname (so you can show views/unique views, clicks/unique clicks, what not)

I will just say again though this is not intended in any way to replace Google Analytics of awstats... which are for the webmaster's consumption (in case you were wondering how I plan to track browser, screen rez, etc.. I don't, you could if you wanted but I've only tested with 3 attributes with a few thousand possible values each. Which is all I plan to support for version 1. Maybe version 2 would be a version that uses DDL instead of EAV)

Re: Statistics Framework

Posted: Thu Feb 18, 2010 2:32 pm
by Christopher
pytrin wrote:This technique is called partitioning (or sharding), and is now supported natively in recent versions of MySQL. You can declare the column to split the table by, the ranges for each partition, and MySQL will select the partition for you so you don't need to manually join.
I would call it "poor man's sharding" and not quite sharding or partitioning as I understand the term.

Re: Statistics Framework

Posted: Thu Feb 18, 2010 3:21 pm
by Eran
Partitioning is exactly what you described - splitting a table into multiple tables (partitions), usually by a range on some key, in order to scale a large amount of data. You might have used it in the most simple manner - two tables, but it is indeed partitioning. It is sometimes called sharding, though sharding is also used to describe a similar technique in which different tables are kept on different physical servers.