Spidering a reciprocol link site for a link

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

mjseaden
Forum Contributor
Posts: 458
Joined: Wed Mar 17, 2004 5:49 am

Spidering a reciprocol link site for a link

Post by mjseaden »

Hi there,

I need some simple code to spider a site, given the site's URL, for a link to my page so that I can check automatically over a 3 week period after requesting a link whether the webmaster has added my link to one of their pages.

I need it to spider to whole site because it is not always clear which page the link may be added to.

Any help would be greatly appreciated - it just looks like a simple regex() and fopen() script.

Many thanks

Mark
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

the source browser I wrote here, may be of interest.

viewtopic.php?t=29312
mjseaden
Forum Contributor
Posts: 458
Joined: Wed Mar 17, 2004 5:49 am

Script checker

Post by mjseaden »

Hi,

I appreciate the response, but I'm afraid the script's beyond me for the moment.

All I need is a script that, given an URL, will return all the links <A HREFs> on the page, likely using fopen() and regex(). I can do the rest.

In fact, can anyone give me a regex() that will return the URL in a <A HREF""> tag?

Cheers

Mark
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

the regex needed is in that thread. It handles src, href, and action from what I remember. it does not handle Javascript ones..
mjseaden
Forum Contributor
Posts: 458
Joined: Wed Mar 17, 2004 5:49 am

Post by mjseaden »

Hi feyd,

I think the one you're talking about is of the form #://#, is that right? However, how do I get it to return the URLs in the opened page?

Sorry for being crap.

Cheers

Mark
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

Code: Select all

$urls = array( 'href', 'src', 'action', 'background' );   //   resolve these attributes from the text
      
$urls = implode( '|', $urls );
preg_match_all( '#\s+?(' . $urls . ')\s*?=\s*?(&#1111;''"]?)(.*?)\\2&#1111;\s\>]#is', $data, $matches );

print_r($matches);
is the url finding code.
mjseaden
Forum Contributor
Posts: 458
Joined: Wed Mar 17, 2004 5:49 am

Post by mjseaden »

Hi feyd

Thanks, but I'm getting these errors with the following script:

Code: Select all

Warning: fopen("http://www.???.biz/index.php", "r") - Success in /home/XXX/quickcheck.php on line 5 Warning: stat failed for http://www.???.biz/index.php (errno=2 - No such file or directory) in /home/XXX/quickcheck.php on line 6 Warning: Supplied argument is not a valid File-Handle resource in /home/XXX/quickcheck.php on line 6 Array ( &#1111;0] =&gt; Array ( ) &#1111;1] =&gt; Array ( ) &#1111;2] =&gt; Array ( ) &#1111;3] =&gt; Array ( ) ) Warning: Supplied argument is not a valid File-Handle resource in /home/XXX/quickcheck.php on line 14

Code: Select all

<?php
// URL finding code
// get contents of a file into a string
$filename = $_GET['url'];
$handle = fopen($filename, "r");
$contents = fread($handle, filesize($filename));

// Retrieve all URLs from the HTML
$urls = array( 'href', 'src', 'action', 'background' );   //   resolve these attributes from the text    
$urls = implode( '|', $urls ); 
preg_match_all( '#\s+?(' . $urls . ')\s*?=\s*?([''"]?)(.*?)\\2[\s\>]#is', $content, $matches ); 
print_r($matches);

fclose($handle);
?>
Any idea what's wrong with this? It seems to be warning me that the file open was successful!?

Cheers

Mark
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

you'll need to use a defined size of read, as filesize() is unable to determine the correct size.

I'd suggest file_get_contents() instead.
mjseaden
Forum Contributor
Posts: 458
Joined: Wed Mar 17, 2004 5:49 am

Post by mjseaden »

Hi feyd,

It looks like my version of PHP 4 is slightly out of date and it's claiming it's an undefined function. Is there an equivalent way of doing it?

Cheers

Mark
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

8O wow.. old version.. uh

Code: Select all

function fileContents($file)
&#123;
  $fp = @fopen($url, 'rb');
  if(!$fp) return '';

  $contents = '';
  while(feof($fp) !== false)
    $contents .= fread($fp, 1024);

  fclose($fp);

  return $contents;
&#125;
mjseaden
Forum Contributor
Posts: 458
Joined: Wed Mar 17, 2004 5:49 am

Post by mjseaden »

Thanks feyd,

Code: Select all

Array ( &#1111;0] => Array ( ) &#1111;1] => Array ( ) &#1111;2] => Array ( ) &#1111;3] => Array ( ) )
It doesn't seem to be returning any contents - for example in this case I used http://www.google.com/index.html.

Cheers

Mark
mjseaden
Forum Contributor
Posts: 458
Joined: Wed Mar 17, 2004 5:49 am

Post by mjseaden »

Got it - $url should be $file in the function.

Cheers

Mark
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

:oops: sorry. :)
mjseaden
Forum Contributor
Posts: 458
Joined: Wed Mar 17, 2004 5:49 am

Post by mjseaden »

Feyd,

Using the following script

Code: Select all

<?php
// URL finding code
// get contents of a file into a string
function fileContents($file) 
&#123; 
  $fp = @fopen($file, 'rb'); 
  if(!$fp) return ''; 

  $contents = ''; 
  while(feof($fp) !== false) 
    $contents .= fread($fp, 1024); 

  fclose($fp); 

  return $contents; 
&#125;

$filename = $_GET&#1111;'url'];
$contents = fileContents( $filename );

// Retrieve all URLs from the HTML
$urls = array( 'href', 'src', 'action', 'background' );   //   resolve these attributes from the text    
$urls = implode( '|', $urls ); 
preg_match_all( '#\s+?(' . $urls . ')\s*?=\s*?(&#1111;''"]?)(.*?)\\2&#1111;\s\>]#is', $content, $matches ); 
print_r($matches);
?>
I'm getting the following output for ?url=http://www.google.com/index.html:

Code: Select all

Array ( &#1111;0] => Array ( ) &#1111;1] => Array ( ) &#1111;2] => Array ( ) &#1111;3] => Array ( ) )
Any ideas? It returns the same for other URLs.

Cheers
Mark
mjseaden
Forum Contributor
Posts: 458
Joined: Wed Mar 17, 2004 5:49 am

Post by mjseaden »

$content should be $contents in preg_match()...
Post Reply