Reading contents of url - missing HTML comment tags.

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
nikosb
Forum Newbie
Posts: 8
Joined: Sat Sep 23, 2006 5:46 pm

Reading contents of url - missing HTML comment tags.

Post by nikosb »

feyd | Please use

Code: Select all

,

Code: Select all

and [syntax="..."] tags where appropriate when posting code. Your post has been edited to reflect how we'd like it posted. Please read:  [url=http://forums.devnetwork.net/viewtopic.php?t=21171]Posting Code in the Forums[/url] to learn how to do it too.[/color]


Hello,

I am a beginner in PHP programming and I've been having a particular problem with reading the contents of a url. Let's take for example the search results page from google for any random keyword.

[url]http://www.google.com/search?hl=en&ie=ISO-8859-1&q=phpdn&btnG=Google+Search[/url]

This is the link for the search results pages for the keyword "phpdn" . When you visit this url with your browser and  do a View Document Source you will see that there are various HTML comment tags (i.e. <!--a-->, <!--m-->) in the HTML code. As a first step I try to read the entire page using a simple PHP program. I have tried three different ways to do that:

Code: Select all

<?php
$website = "http://www.google.com/search?hl=en&ie=ISO-8859-1&q=phpdn&btnG=Google+Search";

1. Using HTTP_Request from PEAR:

require 'HTTP/Request.php';
$r = new HTTP_Request($website);
$r->sendRequest();
$page = $r->getResponseBody();

2. Using the cURL extension:
$c = curl_init($website);
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
$page = curl_exec($c);
curl_close($c);

3. Using fopen and fread:
$page = '';
$fh = fopen($website, 'r') or die($php_errormsg);
while (! feof($fh)) {
  $page .= fread($fh,1048576);
}
fclose($fh);

At the end I print $page:

echo $page;
?>
When the program is executed it prints the content of $page and the page that I get on my browser looks identical to the original page. However when I do View Document Source on the php generated page containing the contents of $page I realise that it is missing the HTML comments tags of the original page. For example the <!--a-->, <!--m--> or <!--n--> tags that are in the orginal page (google search results) are not in the contents of $page. What is wrong here? I am not doing something right? I understand that the curl and HTTP_Request are already defined and I have little control over how they read the contents of a URL, but I would expect that with the fopen and fread commands I would be able to read exactly the contents of the URL I am fetching. Why can I not read the HTML comment tags? Any suggestions how to do that?

Thank you,

Nikolaos


feyd | Please use

Code: Select all

,

Code: Select all

and [syntax="..."] tags where appropriate when posting code. Your post has been edited to reflect how we'd like it posted. Please read:  [url=http://forums.devnetwork.net/viewtopic.php?t=21171]Posting Code in the Forums[/url] to learn how to do it too.[/color]
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

It could be due to the browser agent php sends in order to retrieve the page that triggers the response to not include them. There's a directive where you can alter this setting via ini_set().

I hope you have permission to pull data from whatever site you actually are trying, as it is generally against many sites' usage policies to pull data from them without prior written permission.
Post Reply