Page 1 of 2
Spidering a reciprocol link site for a link
Posted: Mon Jan 24, 2005 6:16 am
by mjseaden
Hi there,
I need some simple code to spider a site, given the site's URL, for a link to my page so that I can check automatically over a 3 week period after requesting a link whether the webmaster has added my link to one of their pages.
I need it to spider to whole site because it is not always clear which page the link may be added to.
Any help would be greatly appreciated - it just looks like a simple regex() and fopen() script.
Many thanks
Mark
Posted: Mon Jan 24, 2005 8:16 am
by feyd
the source browser I wrote here, may be of interest.
viewtopic.php?t=29312
Script checker
Posted: Thu Jan 27, 2005 1:25 pm
by mjseaden
Hi,
I appreciate the response, but I'm afraid the script's beyond me for the moment.
All I need is a script that, given an URL, will return all the links <A HREFs> on the page, likely using fopen() and regex(). I can do the rest.
In fact, can anyone give me a regex() that will return the URL in a <A HREF""> tag?
Cheers
Mark
Posted: Thu Jan 27, 2005 1:27 pm
by feyd
the regex needed is in that thread. It handles src, href, and action from what I remember. it does not handle Javascript ones..
Posted: Thu Jan 27, 2005 1:37 pm
by mjseaden
Hi feyd,
I think the one you're talking about is of the form #://#, is that right? However, how do I get it to return the URLs in the opened page?
Sorry for being crap.
Cheers
Mark
Posted: Thu Jan 27, 2005 1:41 pm
by feyd
Code: Select all
$urls = array( 'href', 'src', 'action', 'background' ); // resolve these attributes from the text
$urls = implode( '|', $urls );
preg_match_all( '#\s+?(' . $urls . ')\s*?=\s*?(ї''"]?)(.*?)\\2ї\s\>]#is', $data, $matches );
print_r($matches);
is the url finding code.
Posted: Thu Jan 27, 2005 1:54 pm
by mjseaden
Hi feyd
Thanks, but I'm getting these errors with the following script:
Code: Select all
Warning: fopen("http://www.???.biz/index.php", "r") - Success in /home/XXX/quickcheck.php on line 5 Warning: stat failed for http://www.???.biz/index.php (errno=2 - No such file or directory) in /home/XXX/quickcheck.php on line 6 Warning: Supplied argument is not a valid File-Handle resource in /home/XXX/quickcheck.php on line 6 Array ( ї0] => Array ( ) ї1] => Array ( ) ї2] => Array ( ) ї3] => Array ( ) ) Warning: Supplied argument is not a valid File-Handle resource in /home/XXX/quickcheck.php on line 14
Code: Select all
<?php
// URL finding code
// get contents of a file into a string
$filename = $_GET['url'];
$handle = fopen($filename, "r");
$contents = fread($handle, filesize($filename));
// Retrieve all URLs from the HTML
$urls = array( 'href', 'src', 'action', 'background' ); // resolve these attributes from the text
$urls = implode( '|', $urls );
preg_match_all( '#\s+?(' . $urls . ')\s*?=\s*?([''"]?)(.*?)\\2[\s\>]#is', $content, $matches );
print_r($matches);
fclose($handle);
?>
Any idea what's wrong with this? It seems to be warning me that the file open was successful!?
Cheers
Mark
Posted: Thu Jan 27, 2005 1:55 pm
by feyd
you'll need to use a defined size of read, as filesize() is unable to determine the correct size.
I'd suggest file_get_contents() instead.
Posted: Thu Jan 27, 2005 1:59 pm
by mjseaden
Hi feyd,
It looks like my version of PHP 4 is slightly out of date and it's claiming it's an undefined function. Is there an equivalent way of doing it?
Cheers
Mark
Posted: Thu Jan 27, 2005 2:03 pm
by feyd

wow.. old version.. uh
Code: Select all
function fileContents($file)
{
$fp = @fopen($url, 'rb');
if(!$fp) return '';
$contents = '';
while(feof($fp) !== false)
$contents .= fread($fp, 1024);
fclose($fp);
return $contents;
}
Posted: Thu Jan 27, 2005 2:07 pm
by mjseaden
Thanks feyd,
Code: Select all
Array ( ї0] => Array ( ) ї1] => Array ( ) ї2] => Array ( ) ї3] => Array ( ) )
It doesn't seem to be returning any contents - for example in this case I used
http://www.google.com/index.html.
Cheers
Mark
Posted: Thu Jan 27, 2005 2:08 pm
by mjseaden
Got it - $url should be $file in the function.
Cheers
Mark
Posted: Thu Jan 27, 2005 2:10 pm
by feyd

sorry.

Posted: Thu Jan 27, 2005 2:12 pm
by mjseaden
Feyd,
Using the following script
Code: Select all
<?php
// URL finding code
// get contents of a file into a string
function fileContents($file)
{
$fp = @fopen($file, 'rb');
if(!$fp) return '';
$contents = '';
while(feof($fp) !== false)
$contents .= fread($fp, 1024);
fclose($fp);
return $contents;
}
$filename = $_GETї'url'];
$contents = fileContents( $filename );
// Retrieve all URLs from the HTML
$urls = array( 'href', 'src', 'action', 'background' ); // resolve these attributes from the text
$urls = implode( '|', $urls );
preg_match_all( '#\s+?(' . $urls . ')\s*?=\s*?(ї''"]?)(.*?)\\2ї\s\>]#is', $content, $matches );
print_r($matches);
?>
I'm getting the following output for ?url=
http://www.google.com/index.html:
Code: Select all
Array ( ї0] => Array ( ) ї1] => Array ( ) ї2] => Array ( ) ї3] => Array ( ) )
Any ideas? It returns the same for other URLs.
Cheers
Mark
Posted: Thu Jan 27, 2005 2:16 pm
by mjseaden
$content should be $contents in preg_match()...