Page 2 of 2

Posted: Thu Jan 27, 2005 2:21 pm
by mjseaden
feyd,

Really sorry about this, but it still doesn't seem to be working.

I'm still getting empty returns when trying for http://www.google.co.uk/index.html.

The code is

Code: Select all

<?php
// URL finding code
// get contents of a file into a string
function fileContents($file) 
&#123; 
  $fp = @fopen($file, 'rb'); 
  if(!$fp) return ''; 

  $contents = ''; 
  while(feof($fp) !== false) 
    $contents .= fread($fp, 1024); 

  fclose($fp); 

  return $contents; 
&#125;

$filename = $_GET&#1111;'url'];
$contents = fileContents( $filename );

// Retrieve all URLs from the HTML
$urls = array( 'href', 'src', 'action', 'background' );   //   resolve these attributes from the text    
$urls = implode( '|', $urls ); 
preg_match_all( '#\s+?(' . $urls . ')\s*?=\s*?(&#1111;''"]?)(.*?)\\2&#1111;\s\>]#is', $contents, $matches ); 
print_r($matches);
?>
and the output is

Code: Select all

Array ( &#1111;0] => Array ( ) &#1111;1] => Array ( ) &#1111;2] => Array ( ) &#1111;3] => Array ( ) )
When I echo $contents, it's not returning any data, so it appears the fileContents function isn't working. I know http://www.google.co.uk/index.html exists.

Any idea what's going on?

Cheers

Mark

Posted: Thu Jan 27, 2005 2:38 pm
by mjseaden
Feyd!

I've fixed the file opening code using file().

I get the following output:

Code: Select all

Array ( &#1111;0] => Array ( &#1111;0] => href="/imghp?hl=en&tab=wi&ie=UTF-8"> &#1111;1] => href="/grphp?hl=en&tab=wg&ie=UTF-8"> &#1111;2] => href="/nwshp?hl=en&tab=wn&ie=UTF-8"> &#1111;3] => href="/options/index.html" &#1111;4] => href=/advanced_search?hl=en> &#1111;5] => href=/preferences?hl=en> &#1111;6] => href=/language_tools?hl=en> &#1111;7] => href="/ads/"> &#1111;8] => href=/services/> &#1111;9] => href=/intl/en/about.html> &#1111;10] => href=http://www.google.co.uk/jobs/> &#1111;11] => href=http://www.google.com/ncr> ) &#1111;1] => Array ( &#1111;0] => href &#1111;1] => href &#1111;2] => href &#1111;3] => href &#1111;4] => href &#1111;5] => href &#1111;6] => href &#1111;7] => href &#1111;8] => href &#1111;9] => href &#1111;10] => href &#1111;11] => href ) &#1111;2] => Array ( &#1111;0] => " &#1111;1] => " &#1111;2] => " &#1111;3] => " &#1111;4] => &#1111;5] => &#1111;6] => &#1111;7] => " &#1111;8] => &#1111;9] => &#1111;10] => &#1111;11] => ) &#1111;3] => Array ( &#1111;0] => /imghp?hl=en&tab=wi&ie=UTF-8 &#1111;1] => /grphp?hl=en&tab=wg&ie=UTF-8 &#1111;2] => /nwshp?hl=en&tab=wn&ie=UTF-8 &#1111;3] => /options/index.html &#1111;4] => /advanced_search?hl=en &#1111;5] => /preferences?hl=en &#1111;6] => /language_tools?hl=en &#1111;7] => /ads/ &#1111;8] => /services/ &#1111;9] => /intl/en/about.html &#1111;10] => http://www.google.co.uk/jobs/ &#1111;11] => http://www.google.com/ncr ) )
With the following code:

Code: Select all

<?php
$filename = $_GET&#1111;'url'];
$contents = implode('', file($filename));

// Retrieve all URLs from the HTML
$urls = array( 'href' );   //   resolve these attributes from the text    
$urls = implode( '|', $urls ); 
preg_match_all( '#\s+?(' . $urls . ')\s*?=\s*?(&#1111;''"]?)(.*?)\\2&#1111;\s\>]#is', $contents, $matches ); 

print_r($matches);
?>
This looks good, as it looks correct! However, it looks like a double dimension array, and some of the array elements seem to just store 'href', some only ".

Is there any way to get a straight one-dimensional array with just the HREF="<contents>" <contents> stored in each element?

I'd really appreciate your help on this, as I'll be able to continue with my project.

Cheers

Mark

Posted: Thu Jan 27, 2005 2:38 pm
by feyd
there were several problems... of which I went through and fixed.. haven't tested it on much though..

Code: Select all

<?php

// URL finding code
// get contents of a file into a string
function fileContents($file)
&#123;
  $fp = fopen($file, 'rb');
  
  if(!$fp) return '';

  $contents = '';
  while(!feof($fp))
  &#123;
    $contents .= fread($fp, 1024);
  &#125;

  fclose($fp);

  return $contents;
&#125;

//$filename = $_SERVER&#1111;'argv']&#1111;1];
$filename = $_GET&#1111;'url'];
$contents = fileContents( $filename );

var_export($contents);

// Retrieve all URLs from the HTML
$urls = array( 'href', 'src', 'action', 'background' );   //   resolve these attributes from the text   
$urls = implode( '|', $urls );
preg_match_all( '#(?<!&#1111;a-z0-9])(' . $urls . ')\s*?=\s*?(&#1111;''"]?)(.*?)\\2&#1111;\s>]#is', $contents, $matches );
print_r($matches);

?>

Posted: Thu Jan 27, 2005 2:38 pm
by John Cartwright
you know there is an edit button...

no need for 4 posts in a row.

Posted: Thu Jan 27, 2005 2:41 pm
by feyd
getting a unidimensional version is just $matches[0]

Posted: Thu Jan 27, 2005 2:46 pm
by mjseaden
function 'var_export' is not recognised. Hmm.

Posted: Thu Jan 27, 2005 2:48 pm
by feyd
just comment that line out.. it was for debugging.

Posted: Thu Jan 27, 2005 2:53 pm
by mjseaden
Hi feyd

When I

Code: Select all

echo $matches&#1111;0];
I don't get any output apart from the word 'Array'.

Is there any way to get it in serial elements, with the URL in each element?

I hope that makes sense - then this whole issue is resolved!

Many thanks

Mark

Posted: Thu Jan 27, 2005 2:55 pm
by feyd
$matches[0] is an array... print_r($matches[0]);

silly monkey..

8)

Posted: Thu Jan 27, 2005 2:57 pm
by mjseaden
Got it! Needed print_r($matches[3])

Thanks a lot Feyd, I appreciate it!