Page 1 of 1

Filtering Data by Different Time Lengths

Posted: Mon Jan 19, 2009 1:55 am
by sdsportsfanatic
I have a database table storing users on my site. I want to be able to filter the users in 5 different ways (amt of favs, amt of views, amt of plays, amt of reviews, highest rating).

I have a table storing all favs on the site and all the reviews on the site; and I have fields for all users storing amt of views, amt of plays, and rating. I do not have trouble filtering by views, plays, and rating. I was unsure how to filter by favs and reviews, so I added those fields to the users table and they are updated accordingly.

Now I do not have trouble filtering any of those parameters. But I would like to take the filters further, and filter the same data, but only account for numbers accumulated over the past day, the past week, and the past month, in addition to all time. ie: filtering all time may display that a user has 102 views, but filtering by today would give them only 3.

I am unsure if this is possible. My one idea involves making a table for every parameter and storing each statistic with a timestamp. If I had to do that I would have multiple tables with over 1,000,000 entries. This would mean storing every view, every play, etc. If this is the only way to execute my goal, I would still need assistance. How could I filter the users based on the amount of entries they have in another table? I am a little new to MySQL so I am unfamiliar with how this is done.

Is there any easier way to do this? As you can see, I am new to the forum. I don't really like to register on a new forum and ask for help before contributing anything; but I've really hit the wall on this one and I couldn't find any solutions by searching online. I appreciate your help.

Re: Filtering Data by Different Time Lengths

Posted: Mon Jan 19, 2009 2:19 am
by VladSun
[sql]WHERE   `date` BETWEEN $filter_date1 AND $filter_date2[/sql]

Re: Filtering Data by Different Time Lengths

Posted: Mon Jan 19, 2009 2:32 am
by sdsportsfanatic
VladSun wrote:[sql]WHERE   `date` BETWEEN $filter_date1 AND $filter_date2[/sql]
I'm not sure this helps me. The users table does not have the `date` field that I want to filter by. The `date` should be referring to views or plays, selecting the ones that occurred during that timespan, and then determining which user has the most so that they can be displayed first. It is kind of difficult to explain so here is the site so you can see what I'm working with:

http://www.sdsportsfanatic.com/temp/iam ... rtists.php

A user should be able to click one of the rectangle box links (rank, peak, favs, etc) and then click on one of the time restrictions (today, this week, this month, all time) and have the data further filtered.

Re: Filtering Data by Different Time Lengths

Posted: Mon Jan 19, 2009 2:38 am
by VladSun
sdsportsfanatic wrote:
VladSun wrote:[sql]WHERE   `date` BETWEEN $filter_date1 AND $filter_date2[/sql]
I'm not sure this helps me. The users table does not have the `date` field that I want to filter by. The `date` should be referring to views or plays, selecting the ones that occurred during that timespan, and then determining which user has the most so that they can be displayed first.
It's obvious that you shouldn't filter the users table, but the contents table - use JOIN.
Also, it will be much more helpful if you post your DB schema.

Re: Filtering Data by Different Time Lengths

Posted: Mon Jan 19, 2009 2:49 am
by sdsportsfanatic
VladSun wrote:
sdsportsfanatic wrote:
VladSun wrote:[sql]WHERE   `date` BETWEEN $filter_date1 AND $filter_date2[/sql]
I'm not sure this helps me. The users table does not have the `date` field that I want to filter by. The `date` should be referring to views or plays, selecting the ones that occurred during that timespan, and then determining which user has the most so that they can be displayed first.
It's obvious that you shouldn't filter the users table, but the contents table - use JOIN.
Also, it will be much more helpful if you post your DB schema.
Thank you.

Code: Select all

tbl_users: id, time_joined, rank, peak, favs, views, plays, reviews, rating
tbl_favs: id, type, uid, tid, time_added
tbl_reviews: id, uid, tid, tuid, time_written
`tid` in tbl_favs matches an `id` in tbl_users. `tuid` in reviews matches an `id` in tbl_users.

These aren't complete but I'm fairly certain they include what you need to see. As you can see I have fields in the users table for all of those categories, but I'm guessing they are not needed and I should have tables for them all instead. Currently, I only have tables for favs and reviews.

I have never used JOIN. Is there a simple example you could show me using my data? Say I want to display the amount of favs each user has received in the last 7 days organized from most to least.

I appreciate your help.

Re: Filtering Data by Different Time Lengths

Posted: Mon Jan 19, 2009 4:12 am
by VladSun
[sql]SELECT    member.id,     count(post.id) AS post_countFROM    memberINNER JOIN    post ON member.id = post.FK_member_idWHERE    post.create_date BETWEEN '2001-01-01' AND '2002-03-01'GROUP BY    member.idHAVING    post_count > 0[/sql]

By using join you create a "virtual" table constructed from member and post tables by using their relationship (post.FK_member_id <=> member.id). Every row of this "virtual" table has all of the fields from member table and all of the fields from post table. I put FK prefixes to all fields that reference another table (FK is for "foreign key").
By using the WHERE clause above, you filter posts by creation date.
By using GROUP BY clause you tell the SQL engine that you want to count posts per user (count grouped by user).
By using the HAVING clause you filter the already "grouped by" rows (which consist only of a member.id and post_count fields) and remove those who haven't have any posts during the given time period.

Re: Filtering Data by Different Time Lengths

Posted: Mon Jan 19, 2009 7:21 pm
by sdsportsfanatic
VladSun wrote:[sql]SELECT    member.id,     count(post.id) AS post_countFROM    memberINNER JOIN    post ON member.id = post.FK_member_idWHERE    post.create_date BETWEEN '2001-01-01' AND '2002-03-01'GROUP BY    member.idHAVING    post_count > 0[/sql]

By using join you create a "virtual" table constructed from member and post tables by using their relationship (post.FK_member_id <=> member.id). Every row of this "virtual" table has all of the fields from member table and all of the fields from post table. I put FK prefixes to all fields that reference another table (FK is for "foreign key").
By using the WHERE clause above, you filter posts by creation date.
By using GROUP BY clause you tell the SQL engine that you want to count posts per user (count grouped by user).
By using the HAVING clause you filter the already "grouped by" rows (which consist only of a member.id and post_count fields) and remove those who haven't have any posts during the given time period.
Thank you for the explanation. I think I understand and I'm going to test it out tonight.

One question I still have is in regards to creating tables for views and plays. Can it be bad to store all of this data? Every time a user's page is visited, I will need to add an entry to the views table. Same with every time a song is played. I'm guessing this will result in over 1,000,000 entries after 6-10 months that will only grow. Is this inefficient or could it be problematic for my database/hosting?

Thanks again.

Re: Filtering Data by Different Time Lengths

Posted: Mon Jan 19, 2009 11:44 pm
by jack_indigo
sdsportsfanatic wrote:One question I still have is in regards to creating tables for views and plays. Can it be bad to store all of this data? Every time a user's page is visited, I will need to add an entry to the views table. Same with every time a song is played. I'm guessing this will result in over 1,000,000 entries after 6-10 months that will only grow. Is this inefficient or could it be problematic for my database/hosting?
1. Reads are usually less processor-intensive than writes but only if we're talking like joining 3-4 tables and we're not constantly writing to those tables at the same time. If you are, then read on.

2. If you want to speed up those reads, there is the concept of normalization vs. denormalization. It's a spectrum. You can read up about this by looking on Wikipedia about normalization and denormalization. Anyway, if you denormalize the data a little, that means less table joins, and so the queries usually run faster.

3. One thing people do to speed up summary screens, reports, or data that they constantly read is to create duplicate tables in some other database called a data warehouse. So, the way this works is in either one of two designs. One design is to create 3 databases -- OLTP, production, and OLAP (data warehouse). The other is 2 databases - production and OLAP (data warehouse). For the OLTP, that's a database design that is normalized to the umpteenth degree and is designed for very fast writes, such as taking calls in a call center. However, that's not practical to run everything completely normalized, so you create a production database where the OLTP data is migrated to via a cron job or replication schedule. The production database is mostly normalized, but a few tables are denormalized a little in order to provide faster lookups here or there as necessary. The production database is your guaranteed source where you will have your latest data, but it should be avoided for summary screens or reports -- anything that does heavy reads against it day in and day out. For heavy reads, you introduce your OLAP database (aka "data warehouse") which is heavily denormalized so that you can run faster reports and summary screens. The drawback of a data warehouse is that its data is usually slightly old because, if not, then it is constantly taxing the production database with cron jobs to get data out of it. So, what people do is create a cron job that runs like 1-4 times daily, pulling new data out of the production database and into the data warehouse database.

4. The drawbacks of denormalization are that you have larger databases (because it contains more redundant data), and you have potential data integrity issues that you need to be mindful to watchdog at the application layer of your application.

5. Think of a data grooming strategy. For instance, if there is any way to pull old rows out of a database table and into, say, an archive table, or delete the row entirely, after, say, a period of time or a certain status is achieved with a record -- do so. This will reduce the size of the table. So, if I'm storing song favorites for my users, but a user doesn't come back to my site in like 1 year, then I can just delete his account and all records tied to that account. This makes it faster for all your regular users because you don't have stale data in your database.

6. You don't want to prematurely work on your database performance when there might not be an issue. I mean, you do need to look at your data grooming strategy, but I would say move to the first or second normal form (again -- wikipedia, read up on normalization and denormalization) and leave it at that for now. Then, look for bottlenecks as your site grows, and start thinking of the other points of this post for fixing those bottlenecks. Sometimes an index helps. Sometimes how you compose that index is good or bad. Sometimes an index hurts or does little good. And sometimes your data grooming and normalization/denormalization strategy is key to solving your problem.

Re: Filtering Data by Different Time Lengths

Posted: Tue Jan 20, 2009 6:03 pm
by sdsportsfanatic
I appreciate your reply and will take it into consideration as I work. As a less-experienced developer, I didn't quite follow everything; but I was able to get the idea of what you were saying.

Re: Filtering Data by Different Time Lengths

Posted: Tue Jan 20, 2009 8:16 pm
by jack_indigo
sdsportsfanatic wrote:I appreciate your reply and will take it into consideration as I work. As a less-experienced developer, I didn't quite follow everything; but I was able to get the idea of what you were saying.
Okay, terms like OLTP; OLAP; data mart; data warehouse; denormalization; normalization; first, second, x normal forms -- these are confusing. However, they are important to your application's growth if you want to scale to handle demand. I'd say go to Wikipedia and read up on normalization and denormalization, and then try to get a book on your database platform (MySQL, PostgreSQL, etc.) and see what they say about normalization so you can see it in the context of your table creation statements. (Often books don't talk about denormalization, but it is just as important as normalization and basically it's the reverse of normalization.) Then, grow from there. A primer:

- normalization: the process of pulling redundant data out of one table and into other tables, reducing redundant data. Increases write performance, but decreases read performance. Increases data integrity because you don't have to worry about redundant data in multiple tables (although this depends on how far you take this in what are called "normal forms").

- denormalization: the reverse of denormalization -- creating redundant data in tables so that joins are not required. Increases read performance but decreases write performance. Decreases data integrity because you have to worry about copying the same redundant data among multiple tables as you seek to eliminate the need for a join.

- OLTP: Online Transaction Processing -- in terms of database design, this means you've geared a database to be heavily normalized to handle the constant write requests of, say, a call center. Thus, normalization is key. But the problem is, often a completely normalized database is impractical and therefore you need a production database that is slightly normalized. A scheduled job walks data over from the OLTP database to the production database.

- OLAP: Online Analytical Processing -- in terms of database design, this means you've geared a database to be heavily denormalized to handle the constant read requests of, say, a team that generate customer reports. Normally this is a separate database than the OLTP and production databases -- otherwise it would slow down both. Besides, it provides more fault tolerance to have things in a separate database. An OLAP database is also known as a "data warehouse". The data warehouse usually has data that is stale to about 2 hours to an entire day. It's designed for reports. Usually one creates a cron job or replication schedule to walk over data from the production database to the data warehouse (aka OLAP) database.

- Data Warehouse. See OLAP.

- Data Mart. This is usually an isolated portion of your data warehouse for a particular function, such as serving up data for one particular major customer, or group of customers by customer type. Usually it's designed to be like an island of data and by splitting things out, you can provide some system fault tolerance where you can work on other parts of the database and have system outages that may not affect particular data marts.

Re: Filtering Data by Different Time Lengths

Posted: Wed Jan 21, 2009 6:52 pm
by sdsportsfanatic
Makes sense and a learned a few things. :D Thanks for the help.