Page 1 of 1

extracting the source code from a website

Posted: Sat Apr 28, 2007 9:20 pm
by big_mumma
feyd | Please use

Code: Select all

,

Code: Select all

and [syntax="..."] tags where appropriate when posting code. Your post has been edited to reflect how we'd like it posted. Please read:  [url=http://forums.devnetwork.net/viewtopic.php?t=21171]Posting Code in the Forums[/url] to learn how to do it too.[/color]


Hi,

I'm trying to extract all links from an external website.

Code: Select all

<?php
if (getenv('REQUEST_METHOD') == 'POST') {
  $url = $_POST[url];
} else {
  $url = $_GET[url];
}
?>

<form action="<?= $PHP_SELF ?>" method="POST">
URL:<input type="text" name="url" value="<?= $url ?>"/><br>
<input type="submit">
</form>

<?php
  if ($url) {
    $remote = fopen($url, 'r');
    $html = fread($remote, 1048576);
    fclose($remote);

    $urlpattern = '/<a .+<\/a>/i';
    preg_match_all($urlpattern, $html, $matches);
    printf("Output of URLs %d URLs<P>", sizeof($matches[0]));
    foreach ($matches[0] as $u) {
	  $u = trim($u);
	  echo $u."<br>\n";
    }
  }
?>
The code works fine, but I notice that it doesn't extract ALL links from a website, as it is only returning part of the source code.

Using the code

Code: Select all

<?php

if (getenv('REQUEST_METHOD') == 'POST') {
  $url = $_POST[url];
} else {
  $url = $_GET[url];
}
?>

<form action="<?= $PHP_SELF ?>" method="POST">
URL:<input type="text" name="url" value="<?= $url ?>"/><br>
<input type="submit">
</form>

<?php
  if ($url) {
    $remote = fopen($url, 'r');
    $html = fread($remote, 1048576);
    fclose($remote);

	echo $html;
  }
?>
shows that only part of the code is being returned.

e.g. http://www.weberdev.com/ has lots of links, but only the header is returned using the code given above.

Is there a way I can get the source code of an external website as if it was being opened in a browser?


feyd | Please use

Code: Select all

,

Code: Select all

and [syntax="..."] tags where appropriate when posting code. Your post has been edited to reflect how we'd like it posted. Please read:  [url=http://forums.devnetwork.net/viewtopic.php?t=21171]Posting Code in the Forums[/url] to learn how to do it too.[/color]

Posted: Sun Apr 29, 2007 3:37 am
by onion2k
By the looks of it a lot of the content on that site is generated using clientside Javascript. It'll be extremely difficult to get it using PHP.

try curling

Posted: Sun Apr 29, 2007 4:09 am
by afbase
the most canadien sport could help ya!

Code: Select all

<?
function curling (){
	$url = "http://www.weberdev.com";
	$ch = curl_init();
		curl_setopt($ch, CURLOPT_URL,$url);
		curl_setopt($ch, CURLOPT_FAILONERROR, 1);
		curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
		curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
		curl_setopt($ch, CURLOPT_TIMEOUT, 3);
	$result = curl_exec($ch);
		curl_close($ch);
	print $result;
}
curling();
?>
not sure if curl catches everything you want but it does get the links.