stats-spiders

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

dal_oscar
Forum Newbie
Posts: 21
Joined: Fri Apr 07, 2006 6:00 am
Location: UK

stats-spiders

Post by dal_oscar »

I am storing all the IP addresses and their useragents of all visitors on my website in a mysql table, and I want to filter out the spiders and cralwers when I am displaying the totals for webstats.

i have a list of useragents, and when I get a new hit, I am currently just dumping it in a different table. During stats display, i retrieve each hit and compare that useragent to the spider useragent's list.

say I have a table Spiders. In this table I have columns, A, B, C, D through Z

every spider gets written down in the col which maches the spiders useragents first char.

so basically, I don't have my query searching all fields when checking for a spider, but only checking the col which matches the first char.

Think this decreases load and thus waiting time, but it still takes ages to load.

Any suggestions?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

It doesn't really decrease load (it's likely the opposite), but does require your database to store a lot of data that isn't being used.

I'd generally use get_browser() or a pure php version of it.

[edit]fix link
Last edited by feyd on Fri Apr 07, 2006 7:54 am, edited 1 time in total.
dal_oscar
Forum Newbie
Posts: 21
Joined: Fri Apr 07, 2006 6:00 am
Location: UK

stats-spiders

Post by dal_oscar »

Sorry feyd- I dont follow. How do you mean? I already have a list of all the ua's for each hit. The problem is when I am extracting that data and compring each UA with the spider's UA.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

which part don't you follow?
dal_oscar
Forum Newbie
Posts: 21
Joined: Fri Apr 07, 2006 6:00 am
Location: UK

Post by dal_oscar »

Why do I need to use get_browser()? I already have all the useragents.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

It has nothing to do with having the user agents, but more accurately knowing which agents are what. get_browser() builds detailed information about the user agent given to it.
dal_oscar
Forum Newbie
Posts: 21
Joined: Fri Apr 07, 2006 6:00 am
Location: UK

Post by dal_oscar »

ok...so you mean i should store all this info about users while they visit the website. check via get_browser and if they look like a spider, ignore them - else put them in the db.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

I'd still store them, you may want information on them later. Just filter the current display on whether the agent is considered a spider or not. How you do that is up to you.
dal_oscar
Forum Newbie
Posts: 21
Joined: Fri Apr 07, 2006 6:00 am
Location: UK

Post by dal_oscar »

ya...but filtering the display seems to be taking ages - the page never loads.....i need a lot of info - 4 weeks, with abt 8000 hits a day
my list of spiders has 557 spiders, which makes it very tedious the search through...
suggestions on how to speed up the filtering?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

store a flag whether the agent is considered a spider or not, or better yet, breakout the returned information from get_browser() into separate fields (possibly broken into multiple tables depending on how confortable you are with normalization)
dal_oscar
Forum Newbie
Posts: 21
Joined: Fri Apr 07, 2006 6:00 am
Location: UK

Post by dal_oscar »

if i want to store a flag there are two issues
1) i will need to do this when a new hit is registered, which means more waitin time for my user
and more importantly,
2) the "decide if it is a spider" is taking lots of time anways -whether i do it on my user's time or when the stats display is being generated. i ahve tried 3 things for deciding what is a spider
a)just do a text search (thats 557 strings being searched evey time a hit is picked up) in normal php code
b)put all the spider's list in a database and query the database (obviously takes longer that the first one)
c)divide the spiders table in A,B,C....Z and when a new hit comes, check if it begins with 'a' and then query the database. so this divides the load

But its still taking hours!!!!! I REALLY appreciate your help on this. I have been trying to fix this for the past 2 months!
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

As I said, use get_browser(). You don't have to search through your list of spiders as it has a huge list built-in and can find out whether it's a spider very quickly.
dal_oscar
Forum Newbie
Posts: 21
Joined: Fri Apr 07, 2006 6:00 am
Location: UK

Post by dal_oscar »

OK. so when someone visits my website, i go via get_browser and then store them in 2 different tables - normal people vs spiders?

could u give me an example of how to use get_browser for such a thing?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

dal_oscar wrote:OK. so when someone visits my website, i go via get_browser and then store them in 2 different tables - normal people vs spiders?
Personally, it'd be a normalized approach where I'd have one main table followed by several related tables for things like "platform," "browser," etcetera.
via aol translation, dal_oscar wrote:could you give me an example of how to use get_browser for such a thing?
Example outputs from the function are given in both pages I have linked to in this thread already. At minimum, store most, if not all, of the elements returned by get_browser() (or my function, posted earlier) as separate fields in the table. For statistics, you'll likely want to have this data broken apart already anyways as it makes for grouping together bits much easier.
dal_oscar
Forum Newbie
Posts: 21
Joined: Fri Apr 07, 2006 6:00 am
Location: UK

Post by dal_oscar »

sorry, i gues si didnt frame my question correctly.
what i am saying is that, once I have all the info from get_browser, how do i decide whether or not that is a spider?
Post Reply