Page 1 of 1

Reading a webpage (using cURL) => How?

Posted: Wed Jul 14, 2004 4:06 am
by visionmaster
Hello,

I would like to grab text of a given URL and place it in a string.

I got involved in using cURL and would like to "leach" a website. Since cURL offers a lot of possibilities to set a timeout and delivers detailed error messages and error numbers, it seems to be a good choice.

Following PHP script reads a website. This actually works fine, but there is one problem, the output dosn't show pictures and links of relative pictures and links. E.g. instead of http://www.url.de/folder/test.gif my browser output shows http://192.xxx.x.xxx/folder/test.gif (the IP of my internal webserver)

=> O.k. this of course just concerns relative links. So my question:
How can I use cURL and make the server I am requesting think I am a regular browser client and not as a server requesting. I set CURLOPT_USERAGENT to "Mozilla/4.0", unfortunately this does not help.

PHP source code:

Code: Select all

<?php
$string = download("http://www.url.de");

echo $string;

function download($url) { 
   $ch = curl_init($url); 
   curl_setopt ($ch, CURLOPT_URL, $url); 
   curl_setopt ($ch, CURLOPT_HEADER, 0); 
   curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); 
   curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/4.0"); 
   $result = curl_exec ($ch); 
   curl_close ($ch); 
   return $result; 
}
?>
I would appreciate any help!

Thanks,
visionmaster


feyd | Please use

Code: Select all

tags when posting code. Read:  [url=http://forums.devnetwork.net/viewtopic.php?t=21171]Posting Code in the Forums[/url][/color]

Posted: Wed Jul 14, 2004 4:11 am
by feyd
it sounds like it's functioning exactly as it's supposed to. You'll need to use regex functions to twist the image urls into working ones, or inject a <base> tag into the html output.

Posted: Wed Jul 14, 2004 6:20 am
by visionmaster
feyd wrote:it sounds like it's functioning exactly as it's supposed to. You'll need to use regex functions to twist the image urls into working ones, or inject a <base> tag into the html output.
O.k., then it is easier to use file_get_contents().

Could you explain your comment of injecting a <base> tag into the html output. How exactly do I do that?

Regards from Germany

Posted: Wed Jul 14, 2004 7:02 am
by feyd
(untested)

Code: Select all

$html_to_output = preg_replace('#(<\s*/\s*head[^>]*>)#i','<base href="'.$url_you_pulled_from.'" />\\1',$html_you_pulled,1);
that should add <base href="http://www.url.de" /> immediately before </head>