Prevent Nasty Crawlers from Crawling

Small, short code snippets that other people may find useful. Do you have a good regex that you would like to share? Share it! Even better, the code can be commented on, and improved.

Moderator: General Moderators

User avatar
fresh
Forum Contributor
Posts: 259
Joined: Mon Jun 14, 2004 10:39 am
Location: Amerika

Prevent Nasty Crawlers from Crawling

Post by fresh »

Code: Select all

<?php
$hostname = gethostbyaddr($_SERVER&#1111;'REMOTE_ADDR']);
$chck = stristr($hostname, '.');
$ccode = ".inktomisearch.com";
if($chck == $ccode) &#123;
//spoof 404
echo "
<html><head>
    <title>404 - Error</title>
</head>
<style>
body &#123; font-family: verdana, arial, sans-serif; font-size: 12pt; color: #333;
       background-color: #fff; margin: 0pt; padding: 0pt; &#125;
</style>
<body>
<div align='center'>
<p><image src='/logo.gif'
       width='200' height='118'
       border='0'
       alt='logo'></p>
<p>&nbsp;</p>
<table width='580'>
<tr><td>
    <h3>404 not found</h3>
    <p>The requested resource could not be found.</p>

</td></tr></table>
</div>
</body>
</html>";
&#125; else &#123;
//let them in
//code here
&#125;
?>
This is an example of how to prevent crawlers such as inktomisearch from crawling your pages, specifically ones you use to track downloads, reviews, etc.. banning may be better, but perhaps this coupled with a ban, may be the double layer you need to kill these nasty beasts.

Meant to be a working example of banning by way of spoof. You could also use this snippet to search hostnames for country codes as well. Let's say that you wish to keep people from lets say .au (australia) from viewing your web-pages, you could search for .au and they would be shown the 404; however, if they bounce off proxies not in australia then of course this and banning will not work, but you know we try.

BTW: would this be considered a honeypot or atleast a form of one?

regards,
- fresh
Last edited by fresh on Sat Jan 15, 2005 10:25 pm, edited 1 time in total.
User avatar
fresh
Forum Contributor
Posts: 259
Joined: Mon Jun 14, 2004 10:39 am
Location: Amerika

Post by fresh »

hey, what do you guys think of this.. is it just completly retarded or what?
kettle_drum
DevNet Resident
Posts: 1150
Joined: Sun Jul 20, 2003 9:25 pm
Location: West Yorkshire, England

Post by kettle_drum »

Sure it works, but I don't see why you want to stop crawlers. Yes there maybe certain sections of your site that you dont want to be crawled, but you can specify them in meta tags or the robots.txt file.

The code isn't just for bots/crawlers though and you might have got more of a response if you generalized the script a bit. As you say if you change it to .au then it will stop all Austrialians from visiting the site. You could take this further and ban user-agents and such - which would be another good way to ban bots.
User avatar
fresh
Forum Contributor
Posts: 259
Joined: Mon Jun 14, 2004 10:39 am
Location: Amerika

Post by fresh »

hey, thats a good idea, but I read about inktomisearch and it said it utilises robot.txt to do it's searching, so this one is a bit nasty.. plus I read others saying how they would like to stop it from crawling, because it floods them with crawls, so I figured it would be useful to someone.. that was a good idea you came up with about the user-agents but those can be changed easy enough..

This mission is quite a pain in the arse if you ask me. :)
timvw
DevNet Master
Posts: 4897
Joined: Mon Jan 19, 2004 11:11 pm
Location: Leuven, Belgium

Post by timvw »

i think you'd better have a look at apache's mod_access.

then have a look at your log files and find out which hosts you don't want to crawl your site, and deny those ;)
User avatar
fresh
Forum Contributor
Posts: 259
Joined: Mon Jun 14, 2004 10:39 am
Location: Amerika

Post by fresh »

hey even better, this is certain ban for sure. Code up some Java, query the machine for the IP, ban that. Proxys will not help this machine hide.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

just make sure it's a valid (internet routable) ip then.. ;)
User avatar
fresh
Forum Contributor
Posts: 259
Joined: Mon Jun 14, 2004 10:39 am
Location: Amerika

Post by fresh »

your right, if they are behind a firewall or NAT it will probably show something different.. I'll have to play with it a bit, if I come up with something concrete, would you guys care if I posted my source here.. It will most likely be in Java though.. ;)
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

no problem here. :)
User avatar
fresh
Forum Contributor
Posts: 259
Joined: Mon Jun 14, 2004 10:39 am
Location: Amerika

Post by fresh »

cool.. thanks Feyd! :)

quick question.. First of all, I figure this is what I will do:

I will write a JAVA socket to listen on port 80 and once someone or bot requests the page and comes to it via browser or telnet I will snatch their IP that way. Otherwise, I could use JSP to strip it from the headers, I already wrote a script to do that along with the SID, Hostname, etc.. but as you know that is a completely useless way of tracking clients, so we use JAVA which runs client side and can create sockets.

Q: If I try to listen on port 80 will that cause a conflict with the HTTP server which is also listening on port 80?

If so, what could be an alternative approach? Maybe I could couple the routine with PHP which may send them to lets say port 1337 and then back to port 80 before they even know what happened and I could throw in a routine that bans predefined IP's as well?

btw: if anyone would like to see my JSP script just ask; also, since no one has objected I will make sure to publish my JAVA source here with a link to DL the class file to use however you want. ;)

regards
User avatar
n00b Saibot
DevNet Resident
Posts: 1452
Joined: Fri Dec 24, 2004 2:59 am
Location: Lucknow, UP, India
Contact:

Post by n00b Saibot »

I think the conflict will be there if u try to run 2 processes to listen on the same port.
Say, if u can code in Java then maybe u can write a prog to listen on port 80 and check if the incoming request is valid then redirect it to the port server lisitens on or something like that.
User avatar
fresh
Forum Contributor
Posts: 259
Joined: Mon Jun 14, 2004 10:39 am
Location: Amerika

Post by fresh »

I figured from my experience with coding in C++ and VB that listening on the same port is foul, so I just assumed that it would cause conflicts. I have already written the server in JAVA, I will make it listen on port 1337 I think and I will couple it with PHP and have the script send them to that port like:

header("Location: http://blah.com:1337")

then I will run the JAVA against the client, retrieve the IP and send them back to port 80, before they even know what happened, it will look like a meta redirect at worst.

I still havent finsished coding everything on the server, right now it accepts connections, retreives remote input and echos it back to the client, which I was using for testing purposes, I plan to remove that before the release. Maybe by the end of the week I will have something complete.
timvw
DevNet Master
Posts: 4897
Joined: Mon Jan 19, 2004 11:11 pm
Location: Leuven, Belgium

Post by timvw »

User avatar
n00b Saibot
DevNet Resident
Posts: 1452
Joined: Fri Dec 24, 2004 2:59 am
Location: Lucknow, UP, India
Contact:

Post by n00b Saibot »

Hey! I hadn't advised ya 2 write a full-fledged server or sumthing . I only said maybe write aloop or so which
wud accept all connections at 80 and if its a valid one then redirect it to real port (1337 according 2 ya).
U dont have to put ur whole week in that simple thing.

:P:P:P
User avatar
fresh
Forum Contributor
Posts: 259
Joined: Mon Jun 14, 2004 10:39 am
Location: Amerika

Post by fresh »

well validation will come after I have retrieved the true IP even if the client is bouncing of proxies, which is the point of this project.

I assume since the HTTP server is already listening on port 80, then I would need to listen on a different port such as 1337.

However flawed, my therory is this:

1. Listen on port 1337
2. User connects to port 80
3. PHP sends them to port 1337
4. Server accepts connection
5. Server queries the machine for IP
6. Server retrieves the IP and logs it
7. Server sends them back to port 80
8. done

I haven't even begun testing this on any HTTP server yet, so far I have only gotten it to run on my PC via command line: java file.class

And it works as expected. I still need to write the query chunk and the redirection chunk and it will be complete. The code is quite small and I would have gotten this done sooner except I have never written anything in JAVA before so it took some time to learn the language.

The time spent is fine because, I can always recycle the code and use it to make chat applets or something and it was fun to learn so, I would say that the time spent was and will continue to be well worth it to me.

Although I do want to ask anyone who may know from experience if my therory is indeed flawed and if so, what can I take as an alternative action in order to achieve the same results.

regards
Post Reply