Page 1 of 1

Possible to setup If when search engine crawler?

Posted: Mon Mar 21, 2011 4:22 pm
by samie
Hey Guys!

I've been working on some sites in PHP but am still fairly new.

I was wondering if it's possible to setup some type of IF statement if the visitor is a crawler?

For example if I had a counter on my side that increments an entry into the database for every visitor, I dont mind if someone gets to the site from a search engine, but wouldn't want a search engine crawling my site to increment the counter.

Any ideas?


This is something I'm doing from scratch(trying anyways).

Re: Possible to setup If when search engine crawler?

Posted: Mon Mar 21, 2011 4:28 pm
by Jonah Bron
There's no sure way, but the best way would be to check the user agent ($_SERVER['HTTP_USER_AGENT']). Search it for "safari", "mozilla", "firefox", "chrome", "opera", "ie", etc. with stripos().

http://php.net/stripos

Re: Possible to setup If when search engine crawler?

Posted: Mon Mar 21, 2011 4:29 pm
by John Cartwright
If you do a search for "PHP robot detection", you'll find quite a few libraries designed for this.

What they basically do is check the user agent in the request against a list of known robot user agents, so it's not 100% (unknown/new robots for instance), but should suffice for your purposes.

Re: Possible to setup If when search engine crawler?

Posted: Mon Mar 21, 2011 5:08 pm
by samie
Thanks you! :)

Re: Possible to setup If when search engine crawler?

Posted: Mon Mar 21, 2011 5:11 pm
by samie
Ahh, just about to read through those links, but the coding mentioned do you think it would be as simple as the following? Or I guess like you guys are saying maybe I'll have to make it specific to each BOT out there.

if (!$_SERVER['HTTP_USER_AGENT']){
//do whatever I need because your not a crawler
}
else{
//do nothing because you ARE a crawler
}

Re: Possible to setup If when search engine crawler?

Posted: Mon Mar 21, 2011 5:19 pm
by Jonah Bron
No, more like this:

Code: Select all

$agents = array('mozilla', 'safari', 'ie', 'firefox', 'opera', 'chrome');
$isHuman = false;
foreach ($agents as $agent) {
    if (stripos($_SERVER['HTTP_USER_AGENT'], $agent) !== false) {
        $isHuman = true;
        break;
    }
}

if ($isHuman) {
    // do whatever you need because it's not a crawler
} else {
    // do nothing because it is a crawler
}
But I like John's idea better: use a pre-fab solution.

Re: Possible to setup If when search engine crawler?

Posted: Mon Mar 21, 2011 5:26 pm
by samie
Ahh ok I just found something similar to that too :P I think I understand it better. Going to try the following. I'm assuming the code in my last doesn't work because every user or crawler will have a http_user_agent but if it's an actual user using a web browser it will be Internet Explorer or Firefox and if it's a crawler it will be like googlebot or something like that. Found the following online and going to try it out. Hopefully it works out for me. Thanks so much!

Should be able to just put if (!is_robot()){
//you're a user so do whatever
}
else{
//you are not!
}

Code: Select all

function is_robot(){

$robots = array(
	"Accoona-AI-Agent",
	"AOLspider",
	"BlackBerry",
	"bot@bot.bot",
	"CazoodleBot",
	"CFNetwork",
	"ConveraCrawler",
	"Cynthia",
	"Dillo",
	"discoveryengine.com",
	"DoCoMo",
	"ee://aol/http",
	"exactseek.com",
	"fast.no",
	"FAST MetaWeb",
	"FavOrg",
	"FS-Web",
	"Gigabot",
	"GOFORITBOT",
	"gonzo",
	"Googlebot-Image",
	"holmes",
	"HTC_P4350",
	"HTML2JPG Blackbox",
	"http://www.uni-koblenz.de/~flocke/robot-info.txt",
	"iArchitect",
	"ia_archiver",
	"ICCrawler",
	"ichiro",
	"IEAutoDiscovery",
	"ilial",
	"IRLbot",
	"Keywen",
	"kkliihoihn nlkio",
	"larbin",
	"libcurl-agent",
	"libwww-perl",
	"Mediapartners-Google",
	"Metasearch Crawler",
	"Microsoft URL Control",
	"MJ12bot",
	"T-H-U-N-D-E-R-S-T-O-N-E",
	"voodoo-it",
	"www.aramamotorusearchengine.com",
	"archive.org_bot",
	"Teoma",
	"Ask Jeeves",
	"AvantGo",
	"Exabot-Images",
	"Exabot",
	"Google Keyword Tool",
	"Googlebot",
	"heritrix",
	"www.livedir.net",
	"iCab",
	"Interseek",
	"jobs.de",
	"MJ12bot",
	"pmoz.info",
	"SnapPreviewBot",
	"Slurp",
	"Danger hiptop",
	"MQBOT",
	"msnbot-media",
	"msnbot",
	"MSRBOT",
	"NetObjects Fusion",
	"nicebot",
	"nrsbot",
	"Ocelli",
	"Pagebull",
	"PEAR HTTP_Request class",
	"Pluggd/Nutch",
	"psbot",
	"Python-urllib",
	"Regiochannel",
	"SearchEngine",
	"Seekbot",
	"segelsuche.de",
	"Semager",
	"ShopWiki",
	"Snappy",
	"Speedy Spider",
	"sproose",
	"TurnitinBot",
	"Twiceler",
	"VB Project",
	"VisBot",
	"voyager",
	"VWBOT",
	"Wells Search",
	"West Wind",
	"Wget",
	"WWW-Mechanize",
	"www.show-tec.net",
	"xxyyzz",
	"yacybot",
	"Yahoo-MMCrawler",
	"yetibot",
);


foreach($robots as $robot){ 
	if(stristr($_SERVER["HTTP_USER_AGENT"],$robot)){ 
		$from_spider=true;
		break;
	} 
} 
 
if($from_spider==true){
	return true;
}
else
{
	return false;
}

}

Re: Possible to setup If when search engine crawler?

Posted: Mon Mar 21, 2011 5:42 pm
by Jonah Bron
If I were you, I'd go the other way. It's a lot easier to keep track of browser user agents than crawlers. And also there's crawlers that don't provide a user agent. You don't even need to put in obscure browsers because it's just a counter.