Finding crawlers without using get_browser()

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
Toneboy
Forum Contributor
Posts: 102
Joined: Wed Jul 31, 2002 5:59 am
Location: Law, Scotland.
Contact:

Finding crawlers without using get_browser()

Post by Toneboy »

I was looking to find a way to keep crawlers out of my users online figures (if they hit in one go they tend to distort things a bit - up to 50 on one occasion). So I did bit of reading and compiled a few tests, and found that my host doesn't put the browscap.ini file on their servers. :(

As my users online figures are stored on a MySql database I was thinking of a query along the following lines:

Code: Select all

SELECT * FROM usersunique WHERE browser NOT LIKE '%googlebot%' ORDER BY 'visitorid' DESC;
N.B. I've only run that line in phpMyAdmin. The browser field in the usersunique table basically holds the $HTTP_USER_AGENT details.

But even eliminating googlebot, grub & inktomi there appear to be other crawlers getting through.

Is there any way you can find out if a user is a crawler without using get_browser()?

Thanks.
User avatar
JayBird
Admin
Posts: 4524
Joined: Wed Aug 13, 2003 7:02 am
Location: York, UK
Contact:

Post by JayBird »

why not just use a robots.txt file on your server?

Code: Select all

# robots, scram

User-agent: *
Disallow: /user_stats/
and also but this in your head tag, just to be safe

Code: Select all

<meta name="robots" content="noindex,nofollow">
Mark
User avatar
Toneboy
Forum Contributor
Posts: 102
Joined: Wed Jul 31, 2002 5:59 am
Location: Law, Scotland.
Contact:

Post by Toneboy »

Hmm... did a search on that in the PHP manual, couldn't find anything. If you're talking about asking my host (I just have a virtual server) to put a .txt file on their server I might as well ask them to put the browscap.ini file there too. Sorry if I've got confused about what you meant, Mark.

Did some more searching. This is from a page with the year dated 2000, so I'll need to research it a bit more before implementing anything along these lines. I'll adapt it and test it later and let you all know how I got on.

Code: Select all

if (strstr($HTTP_USER_AGENT,"htdig") || 
strstr($HTTP_USER_AGENT,"Wget") || strstr($HTTP_USER_AGENT,"Bench") || 
strstr($HTTP_USER_AGENT,"spider") || strstr($HTTP_USER_AGENT,"crawler")) 
{ 
// carry out option if page is a spider/crawler
page_open(array()); 
}
User avatar
Dr Evil
Forum Contributor
Posts: 184
Joined: Wed Jan 14, 2004 9:56 am
Location: Switzerland

Post by Dr Evil »

I think what Toneboy wants is just to keep bots off his stats, but still have the pages indexed.

[EDIT] Sorry my post came too late... [/EDIT]

Dr Evil
User avatar
Toneboy
Forum Contributor
Posts: 102
Joined: Wed Jul 31, 2002 5:59 am
Location: Law, Scotland.
Contact:

Post by Toneboy »

Additional: I certainly don't wish to stop search engine spiders/crawlers indexing my site. Besides anything else from my logs it looks as if the crawlers come onto one page, then do nothing else in that session. If they're hitting in droves they tend to start a new session every 4-5 seconds.

Once anyone comes onto my site a session is started, and the start time, session id and $HTTP_USER_AGENT are stored in a MySql database. If the above script works and I can filter out anything shown to be a spider/crawler I can hopefully have a users online display which accurately displays the correct number of real users.

You see I'd love to think that 50 people at once could be on my site, but I'm realistic enough to know it isn't very likely. ;)
User avatar
Toneboy
Forum Contributor
Posts: 102
Joined: Wed Jul 31, 2002 5:59 am
Location: Law, Scotland.
Contact:

Post by Toneboy »

Dr Evil wrote:I think what Toneboy wants is just to keep bots off his stats, but still have the pages indexed.
Indeedy.
Dr Evil wrote:[EDIT] Sorry my post came too late... [/EDIT]

Dr Evil
No problem, thanks for helping out.
User avatar
JayBird
Admin
Posts: 4524
Joined: Wed Aug 13, 2003 7:02 am
Location: York, UK
Contact:

Post by JayBird »

I wasn't sure wether you wanted to keep to spiders from browsing a specific page, are counting them as a particular visistor.

Anyway, i'll include a description of the robots.txt thing for anyone else interested....

All a robots.txt file is, is a file you upload to yor root directory which allows/disallows spider from certain driectories and/or files.

For more info, read here http://www.searchengineworld.com/robots ... torial.htm

think of any major site, then open up internet explorer and type the URL followed by robots.txt and you will see the major sites use this method

e.g. http://www.cnn.com/robots.txt


Mark
Last edited by JayBird on Thu Jan 15, 2004 8:08 am, edited 1 time in total.
User avatar
Dr Evil
Forum Contributor
Posts: 184
Joined: Wed Jan 14, 2004 9:56 am
Location: Switzerland

Post by Dr Evil »

You could try comparing the visitor to a list: http://www.robotstxt.org/wc/active.html

or use another method to know the visitors online. It's not as elegant but I stock every visitor on any page by IP for 15 minutes in a database. Each query contains an erase of old logs and adds the current one. This allows me to not log my own visits.

Dr Evil
User avatar
Toneboy
Forum Contributor
Posts: 102
Joined: Wed Jul 31, 2002 5:59 am
Location: Law, Scotland.
Contact:

Post by Toneboy »

Okay, forget what I posted earlier. It seems the only way to do this is to look for certain words in the $HTTP_USER_AGENT variable and eliminate them accordingly, a bit like this:

Code: Select all

SELECT * FROM usersunique WHERE browser NOT LIKE '%inktomi%' AND browser NOT LIKE '%googlebot%' AND browser NOT LIKE '%crawl%' ORDER BY 'visitorid' DESC;
(Apologies if the code is lousy, I don't happen to think the MySql.com site is as easy to find its way around as the PHP equivalent.)

That seems to cover most things. In my log I've got one user which looks like it is a spider, but the $HTTP_USER_AGENT gives no clues away. Yet there are about twenty different sessions opened in quick succession by the same I.P. address, possibly something to work on.
d3ad1ysp0rk
Forum Donator
Posts: 1661
Joined: Mon Oct 20, 2003 8:31 pm
Location: Maine, USA

Post by d3ad1ysp0rk »

I'm guessing you have an insert code.. ie:

Code: Select all

<?php
$browser = $HTTP_USER_AGENT;
$sql = "INSERT INTO `table` browser = '$browser'";
mysql_query($sql);
?>
but couldnt you just say

Code: Select all

$container = $_SERVER['HTTP_USER_AGENT'];
$string1 = "google";
$string2 = "msn";
$string3 = "whatever";
if(!strstr($container,$string1) && !strstr($container,$string2) && !strstr($container,$string3)) {
$browser = $HTTP_USER_AGENT;
$sql = "INSERT INTO `table` browser = '$browser'";
mysql_query($sql);
}
?>
should work..
Post Reply