Page 1 of 2

stats-spiders

Posted: Fri Apr 07, 2006 6:02 am
by dal_oscar
I am storing all the IP addresses and their useragents of all visitors on my website in a mysql table, and I want to filter out the spiders and cralwers when I am displaying the totals for webstats.

i have a list of useragents, and when I get a new hit, I am currently just dumping it in a different table. During stats display, i retrieve each hit and compare that useragent to the spider useragent's list.

say I have a table Spiders. In this table I have columns, A, B, C, D through Z

every spider gets written down in the col which maches the spiders useragents first char.

so basically, I don't have my query searching all fields when checking for a spider, but only checking the col which matches the first char.

Think this decreases load and thus waiting time, but it still takes ages to load.

Any suggestions?

Posted: Fri Apr 07, 2006 7:36 am
by feyd
It doesn't really decrease load (it's likely the opposite), but does require your database to store a lot of data that isn't being used.

I'd generally use get_browser() or a pure php version of it.

[edit]fix link

stats-spiders

Posted: Fri Apr 07, 2006 7:51 am
by dal_oscar
Sorry feyd- I dont follow. How do you mean? I already have a list of all the ua's for each hit. The problem is when I am extracting that data and compring each UA with the spider's UA.

Posted: Fri Apr 07, 2006 7:57 am
by feyd
which part don't you follow?

Posted: Fri Apr 07, 2006 7:59 am
by dal_oscar
Why do I need to use get_browser()? I already have all the useragents.

Posted: Fri Apr 07, 2006 8:03 am
by feyd
It has nothing to do with having the user agents, but more accurately knowing which agents are what. get_browser() builds detailed information about the user agent given to it.

Posted: Fri Apr 07, 2006 8:07 am
by dal_oscar
ok...so you mean i should store all this info about users while they visit the website. check via get_browser and if they look like a spider, ignore them - else put them in the db.

Posted: Fri Apr 07, 2006 8:13 am
by feyd
I'd still store them, you may want information on them later. Just filter the current display on whether the agent is considered a spider or not. How you do that is up to you.

Posted: Fri Apr 07, 2006 8:16 am
by dal_oscar
ya...but filtering the display seems to be taking ages - the page never loads.....i need a lot of info - 4 weeks, with abt 8000 hits a day
my list of spiders has 557 spiders, which makes it very tedious the search through...
suggestions on how to speed up the filtering?

Posted: Fri Apr 07, 2006 8:19 am
by feyd
store a flag whether the agent is considered a spider or not, or better yet, breakout the returned information from get_browser() into separate fields (possibly broken into multiple tables depending on how confortable you are with normalization)

Posted: Fri Apr 07, 2006 8:27 am
by dal_oscar
if i want to store a flag there are two issues
1) i will need to do this when a new hit is registered, which means more waitin time for my user
and more importantly,
2) the "decide if it is a spider" is taking lots of time anways -whether i do it on my user's time or when the stats display is being generated. i ahve tried 3 things for deciding what is a spider
a)just do a text search (thats 557 strings being searched evey time a hit is picked up) in normal php code
b)put all the spider's list in a database and query the database (obviously takes longer that the first one)
c)divide the spiders table in A,B,C....Z and when a new hit comes, check if it begins with 'a' and then query the database. so this divides the load

But its still taking hours!!!!! I REALLY appreciate your help on this. I have been trying to fix this for the past 2 months!

Posted: Fri Apr 07, 2006 8:31 am
by feyd
As I said, use get_browser(). You don't have to search through your list of spiders as it has a huge list built-in and can find out whether it's a spider very quickly.

Posted: Fri Apr 07, 2006 8:35 am
by dal_oscar
OK. so when someone visits my website, i go via get_browser and then store them in 2 different tables - normal people vs spiders?

could u give me an example of how to use get_browser for such a thing?

Posted: Fri Apr 07, 2006 8:52 am
by feyd
dal_oscar wrote:OK. so when someone visits my website, i go via get_browser and then store them in 2 different tables - normal people vs spiders?
Personally, it'd be a normalized approach where I'd have one main table followed by several related tables for things like "platform," "browser," etcetera.
via aol translation, dal_oscar wrote:could you give me an example of how to use get_browser for such a thing?
Example outputs from the function are given in both pages I have linked to in this thread already. At minimum, store most, if not all, of the elements returned by get_browser() (or my function, posted earlier) as separate fields in the table. For statistics, you'll likely want to have this data broken apart already anyways as it makes for grouping together bits much easier.

Posted: Fri Apr 07, 2006 8:55 am
by dal_oscar
sorry, i gues si didnt frame my question correctly.
what i am saying is that, once I have all the info from get_browser, how do i decide whether or not that is a spider?