Page 1 of 1

Finding crawlers without using get_browser()

Posted: Thu Jan 15, 2004 6:47 am
by Toneboy
I was looking to find a way to keep crawlers out of my users online figures (if they hit in one go they tend to distort things a bit - up to 50 on one occasion). So I did bit of reading and compiled a few tests, and found that my host doesn't put the browscap.ini file on their servers. :(

As my users online figures are stored on a MySql database I was thinking of a query along the following lines:

Code: Select all

SELECT * FROM usersunique WHERE browser NOT LIKE '%googlebot%' ORDER BY 'visitorid' DESC;
N.B. I've only run that line in phpMyAdmin. The browser field in the usersunique table basically holds the $HTTP_USER_AGENT details.

But even eliminating googlebot, grub & inktomi there appear to be other crawlers getting through.

Is there any way you can find out if a user is a crawler without using get_browser()?

Thanks.

Posted: Thu Jan 15, 2004 6:51 am
by JayBird
why not just use a robots.txt file on your server?

Code: Select all

# robots, scram

User-agent: *
Disallow: /user_stats/
and also but this in your head tag, just to be safe

Code: Select all

<meta name="robots" content="noindex,nofollow">
Mark

Posted: Thu Jan 15, 2004 7:39 am
by Toneboy
Hmm... did a search on that in the PHP manual, couldn't find anything. If you're talking about asking my host (I just have a virtual server) to put a .txt file on their server I might as well ask them to put the browscap.ini file there too. Sorry if I've got confused about what you meant, Mark.

Did some more searching. This is from a page with the year dated 2000, so I'll need to research it a bit more before implementing anything along these lines. I'll adapt it and test it later and let you all know how I got on.

Code: Select all

if (strstr($HTTP_USER_AGENT,"htdig") || 
strstr($HTTP_USER_AGENT,"Wget") || strstr($HTTP_USER_AGENT,"Bench") || 
strstr($HTTP_USER_AGENT,"spider") || strstr($HTTP_USER_AGENT,"crawler")) 
{ 
// carry out option if page is a spider/crawler
page_open(array()); 
}

Posted: Thu Jan 15, 2004 7:40 am
by Dr Evil
I think what Toneboy wants is just to keep bots off his stats, but still have the pages indexed.

[EDIT] Sorry my post came too late... [/EDIT]

Dr Evil

Posted: Thu Jan 15, 2004 7:49 am
by Toneboy
Additional: I certainly don't wish to stop search engine spiders/crawlers indexing my site. Besides anything else from my logs it looks as if the crawlers come onto one page, then do nothing else in that session. If they're hitting in droves they tend to start a new session every 4-5 seconds.

Once anyone comes onto my site a session is started, and the start time, session id and $HTTP_USER_AGENT are stored in a MySql database. If the above script works and I can filter out anything shown to be a spider/crawler I can hopefully have a users online display which accurately displays the correct number of real users.

You see I'd love to think that 50 people at once could be on my site, but I'm realistic enough to know it isn't very likely. ;)

Posted: Thu Jan 15, 2004 7:50 am
by Toneboy
Dr Evil wrote:I think what Toneboy wants is just to keep bots off his stats, but still have the pages indexed.
Indeedy.
Dr Evil wrote:[EDIT] Sorry my post came too late... [/EDIT]

Dr Evil
No problem, thanks for helping out.

Posted: Thu Jan 15, 2004 8:04 am
by JayBird
I wasn't sure wether you wanted to keep to spiders from browsing a specific page, are counting them as a particular visistor.

Anyway, i'll include a description of the robots.txt thing for anyone else interested....

All a robots.txt file is, is a file you upload to yor root directory which allows/disallows spider from certain driectories and/or files.

For more info, read here http://www.searchengineworld.com/robots ... torial.htm

think of any major site, then open up internet explorer and type the URL followed by robots.txt and you will see the major sites use this method

e.g. http://www.cnn.com/robots.txt


Mark

Posted: Thu Jan 15, 2004 8:07 am
by Dr Evil
You could try comparing the visitor to a list: http://www.robotstxt.org/wc/active.html

or use another method to know the visitors online. It's not as elegant but I stock every visitor on any page by IP for 15 minutes in a database. Each query contains an erase of old logs and adds the current one. This allows me to not log my own visits.

Dr Evil

Posted: Thu Jan 15, 2004 6:42 pm
by Toneboy
Okay, forget what I posted earlier. It seems the only way to do this is to look for certain words in the $HTTP_USER_AGENT variable and eliminate them accordingly, a bit like this:

Code: Select all

SELECT * FROM usersunique WHERE browser NOT LIKE '%inktomi%' AND browser NOT LIKE '%googlebot%' AND browser NOT LIKE '%crawl%' ORDER BY 'visitorid' DESC;
(Apologies if the code is lousy, I don't happen to think the MySql.com site is as easy to find its way around as the PHP equivalent.)

That seems to cover most things. In my log I've got one user which looks like it is a spider, but the $HTTP_USER_AGENT gives no clues away. Yet there are about twenty different sessions opened in quick succession by the same I.P. address, possibly something to work on.

Posted: Thu Jan 15, 2004 10:49 pm
by d3ad1ysp0rk
I'm guessing you have an insert code.. ie:

Code: Select all

<?php
$browser = $HTTP_USER_AGENT;
$sql = "INSERT INTO `table` browser = '$browser'";
mysql_query($sql);
?>
but couldnt you just say

Code: Select all

$container = $_SERVER['HTTP_USER_AGENT'];
$string1 = "google";
$string2 = "msn";
$string3 = "whatever";
if(!strstr($container,$string1) && !strstr($container,$string2) && !strstr($container,$string3)) {
$browser = $HTTP_USER_AGENT;
$sql = "INSERT INTO `table` browser = '$browser'";
mysql_query($sql);
}
?>
should work..