Spidering a reciprocol link site for a link

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

mjseaden
Forum Contributor
Posts: 458
Joined: Wed Mar 17, 2004 5:49 am

Post by mjseaden »

feyd,

Really sorry about this, but it still doesn't seem to be working.

I'm still getting empty returns when trying for http://www.google.co.uk/index.html.

The code is

Code: Select all

<?php
// URL finding code
// get contents of a file into a string
function fileContents($file) 
&#123; 
  $fp = @fopen($file, 'rb'); 
  if(!$fp) return ''; 

  $contents = ''; 
  while(feof($fp) !== false) 
    $contents .= fread($fp, 1024); 

  fclose($fp); 

  return $contents; 
&#125;

$filename = $_GET&#1111;'url'];
$contents = fileContents( $filename );

// Retrieve all URLs from the HTML
$urls = array( 'href', 'src', 'action', 'background' );   //   resolve these attributes from the text    
$urls = implode( '|', $urls ); 
preg_match_all( '#\s+?(' . $urls . ')\s*?=\s*?(&#1111;''"]?)(.*?)\\2&#1111;\s\>]#is', $contents, $matches ); 
print_r($matches);
?>
and the output is

Code: Select all

Array ( &#1111;0] => Array ( ) &#1111;1] => Array ( ) &#1111;2] => Array ( ) &#1111;3] => Array ( ) )
When I echo $contents, it's not returning any data, so it appears the fileContents function isn't working. I know http://www.google.co.uk/index.html exists.

Any idea what's going on?

Cheers

Mark
mjseaden
Forum Contributor
Posts: 458
Joined: Wed Mar 17, 2004 5:49 am

Post by mjseaden »

Feyd!

I've fixed the file opening code using file().

I get the following output:

Code: Select all

Array ( &#1111;0] => Array ( &#1111;0] => href="/imghp?hl=en&tab=wi&ie=UTF-8"> &#1111;1] => href="/grphp?hl=en&tab=wg&ie=UTF-8"> &#1111;2] => href="/nwshp?hl=en&tab=wn&ie=UTF-8"> &#1111;3] => href="/options/index.html" &#1111;4] => href=/advanced_search?hl=en> &#1111;5] => href=/preferences?hl=en> &#1111;6] => href=/language_tools?hl=en> &#1111;7] => href="/ads/"> &#1111;8] => href=/services/> &#1111;9] => href=/intl/en/about.html> &#1111;10] => href=http://www.google.co.uk/jobs/> &#1111;11] => href=http://www.google.com/ncr> ) &#1111;1] => Array ( &#1111;0] => href &#1111;1] => href &#1111;2] => href &#1111;3] => href &#1111;4] => href &#1111;5] => href &#1111;6] => href &#1111;7] => href &#1111;8] => href &#1111;9] => href &#1111;10] => href &#1111;11] => href ) &#1111;2] => Array ( &#1111;0] => " &#1111;1] => " &#1111;2] => " &#1111;3] => " &#1111;4] => &#1111;5] => &#1111;6] => &#1111;7] => " &#1111;8] => &#1111;9] => &#1111;10] => &#1111;11] => ) &#1111;3] => Array ( &#1111;0] => /imghp?hl=en&tab=wi&ie=UTF-8 &#1111;1] => /grphp?hl=en&tab=wg&ie=UTF-8 &#1111;2] => /nwshp?hl=en&tab=wn&ie=UTF-8 &#1111;3] => /options/index.html &#1111;4] => /advanced_search?hl=en &#1111;5] => /preferences?hl=en &#1111;6] => /language_tools?hl=en &#1111;7] => /ads/ &#1111;8] => /services/ &#1111;9] => /intl/en/about.html &#1111;10] => http://www.google.co.uk/jobs/ &#1111;11] => http://www.google.com/ncr ) )
With the following code:

Code: Select all

<?php
$filename = $_GET&#1111;'url'];
$contents = implode('', file($filename));

// Retrieve all URLs from the HTML
$urls = array( 'href' );   //   resolve these attributes from the text    
$urls = implode( '|', $urls ); 
preg_match_all( '#\s+?(' . $urls . ')\s*?=\s*?(&#1111;''"]?)(.*?)\\2&#1111;\s\>]#is', $contents, $matches ); 

print_r($matches);
?>
This looks good, as it looks correct! However, it looks like a double dimension array, and some of the array elements seem to just store 'href', some only ".

Is there any way to get a straight one-dimensional array with just the HREF="<contents>" <contents> stored in each element?

I'd really appreciate your help on this, as I'll be able to continue with my project.

Cheers

Mark
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

there were several problems... of which I went through and fixed.. haven't tested it on much though..

Code: Select all

<?php

// URL finding code
// get contents of a file into a string
function fileContents($file)
&#123;
  $fp = fopen($file, 'rb');
  
  if(!$fp) return '';

  $contents = '';
  while(!feof($fp))
  &#123;
    $contents .= fread($fp, 1024);
  &#125;

  fclose($fp);

  return $contents;
&#125;

//$filename = $_SERVER&#1111;'argv']&#1111;1];
$filename = $_GET&#1111;'url'];
$contents = fileContents( $filename );

var_export($contents);

// Retrieve all URLs from the HTML
$urls = array( 'href', 'src', 'action', 'background' );   //   resolve these attributes from the text   
$urls = implode( '|', $urls );
preg_match_all( '#(?<!&#1111;a-z0-9])(' . $urls . ')\s*?=\s*?(&#1111;''"]?)(.*?)\\2&#1111;\s>]#is', $contents, $matches );
print_r($matches);

?>
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Post by John Cartwright »

you know there is an edit button...

no need for 4 posts in a row.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

getting a unidimensional version is just $matches[0]
mjseaden
Forum Contributor
Posts: 458
Joined: Wed Mar 17, 2004 5:49 am

Post by mjseaden »

function 'var_export' is not recognised. Hmm.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

just comment that line out.. it was for debugging.
mjseaden
Forum Contributor
Posts: 458
Joined: Wed Mar 17, 2004 5:49 am

Post by mjseaden »

Hi feyd

When I

Code: Select all

echo $matches&#1111;0];
I don't get any output apart from the word 'Array'.

Is there any way to get it in serial elements, with the URL in each element?

I hope that makes sense - then this whole issue is resolved!

Many thanks

Mark
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

$matches[0] is an array... print_r($matches[0]);

silly monkey..

8)
mjseaden
Forum Contributor
Posts: 458
Joined: Wed Mar 17, 2004 5:49 am

Post by mjseaden »

Got it! Needed print_r($matches[3])

Thanks a lot Feyd, I appreciate it!
Post Reply