Whence the page size variation?

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
RobBroekhuis
Forum Newbie
Posts: 5
Joined: Tue Nov 02, 2004 6:32 am
Location: Allentown, PA, USA

Whence the page size variation?

Post by RobBroekhuis »

I've noticed that page sizes for .php pages, as reported by the number of bytes in my server logs, always move up and down a little bit. But the issue got a little more of my attention when I noticed this morning that one of my main database-generated pages varies by as much as 6000 bytes (between 35000 and 41000). The database isn't changing, and the page as presented to the user isn't changing. There is no querystring, nothing fancy. So where do the size variations come from?
Please set me straight - thanks!
Rob
kettle_drum
DevNet Resident
Posts: 1150
Joined: Sun Jul 20, 2003 9:25 pm
Location: West Yorkshire, England

Post by kettle_drum »

Does the page show their username? Is there a time and/or date shown somewhere on the page? Try running the page yourself and seeing what size you get.
rehfeld
Forum Regular
Posts: 741
Joined: Mon Oct 18, 2004 8:14 pm

Post by rehfeld »

maybe its throwing error messages? check which user agent conforms to the oddly sized pages, and maybe try viewing the site w/ that browser.


you could also temporarily use output buffering to capture the output sent to the user and log it, so you can look at it

Code: Select all

<?php
ob_start(); // put this at the very beg of the script, make sure its beforeany includes

// entires script goes here

$output = ob_get_contents(); // put this at very end of script, after all includes

// write the $output to a file
$filename = time() . '.html';
$fp = fopen($filename, 'w');
fwrite($fp, $output);
fclose($fp);

echo $output; // still gotta show it to the user or they wont like your website much

?>
RobBroekhuis
Forum Newbie
Posts: 5
Joined: Tue Nov 02, 2004 6:32 am
Location: Allentown, PA, USA

partial answers...

Post by RobBroekhuis »

To partly answer your followup questions...
No - the html content of the page is not changing (no username, date, or other variations). When I load the page twice into IE, "view source", and save as text file, the two text files have identical size. The two log entries show a different number of bytes.
So I get the size variations for my own requests - but the most extreme variations I've seen come when Googlebot requests the same page twice. Does that help in understanding the issue?
Will the output buffering approach capture more than just the page html? I think I'll give it a try, anyway - thanks for the suggestion.
rehfeld
Forum Regular
Posts: 741
Joined: Mon Oct 18, 2004 8:14 pm

Post by rehfeld »

no it will only capture the html

if using php5, i know how to get the headers, but not on php4.....

what you could do, is use a local script to request the page in question from your site, and spoof the diff useragents that google is using(since you know its casuing diff page sizes)

if you do that, getting the headers is a snap(and works in php4)

Code: Select all

<?php

$ua = 'googles uagent from your log file'; // make sure you run the script for each of the diff ua's google uses
// google sometimes requests the same page more than once, and pretends to be a diff browser, to see if your sniffing for google trying to feed it keywords
ini_set('user_agent', $ua);

$url = 'http://example.org'; // your page in question

$fp = fopen($url, 'r');
$meta_data = stream_get_meta_data($fp);


echo '<pre>';
print_r($meta_data);


?>
oh and small size variations could easily be due to setting cookies
Last edited by rehfeld on Tue Nov 02, 2004 8:01 pm, edited 2 times in total.
RobBroekhuis
Forum Newbie
Posts: 5
Joined: Tue Nov 02, 2004 6:32 am
Location: Allentown, PA, USA

Post by RobBroekhuis »

Rehfeld,
I implemented the output buffering (by the way, the echo statement at the end is not necessary - php will clear its buffer when it runs its course). With very curious results: with the output buffering in place, I no longer get varying load sizes - they are all exactly the same, at the low end of the previously noted range. The sizes of the files written are also all the same (a few bytes smaller than the load size), and the "view source" saved version is somewhat larger (I suspect because CRs are converted to CRLFs). So the buffering suppresses the number of "extra bytes" sent. I wonder if, during processing, php or Apache sends "keep-alive" protocol bytes to let the client's browser know it's still working? Just a random guess...
rehfeld
Forum Regular
Posts: 741
Joined: Mon Oct 18, 2004 8:14 pm

Post by rehfeld »

thats a good theory. def could be happening.

but a few thousand bytes seems more than what a header or 2 would cause.

im thinking php is throwing errors.

if it was throwing 'headers already sent' errors, like when you try to start a session after sending some html, output buffering would eliminate those errors, and it would mask the problem.
rehfeld
Forum Regular
Posts: 741
Joined: Mon Oct 18, 2004 8:14 pm

Post by rehfeld »

ive been wanting to make something like this anyway, just hadnt got around to it.

this should help you out.

Code: Select all

<?php



if (isSet($_POST['user_agent'])) {
    $user_agent = $_POST['user_agent'];
} else {
    $user_agent = $_SERVER['HTTP_USER_AGENT'];
}

ini_set('user_agent', $user_agent);



$html = '';
if (!empty($_POST['url'])) {

    $fp = @fopen($_POST['url'], 'r');

    if ($fp) {

        while (!feof($fp)) {
            $html .= fread($fp, 1024);
        }

        fclose($fp);

    }

}






?>

<form method="post" action="<?php echo $_SERVER['PHP_SELF']; ?>">

<p>Url, must be of the form http://example.org<br>
    <input type="text" size="100" name="url" value="<?php echo $_POST['url']; ?>"></p>

<p>The user Agent you want to pretend to be(defaults to the user agent of your browser)<br>
    <input type="text" size="100" name="user_agent" value="<?php echo $user_agent; ?>"></p>

<p><input type="submit"></p>
</form>

<hr>

<pre>

<?php if (isSet($http_response_header)) print_r($http_response_header); ?>

<hr>


<?php echo htmlentities($html); ?>

</pre>
RobBroekhuis
Forum Newbie
Posts: 5
Joined: Tue Nov 02, 2004 6:32 am
Location: Allentown, PA, USA

Post by RobBroekhuis »

Rehfeld,
That's a useful little script - I put it in place as a testing playground. The html bit comes through just as it would in a browser (i.e., what I see with "view source"). The header bit is given as:
Array
(
[0] => HTTP/1.1 200 OK
[1] => Date: Wed, 03 Nov 2004 12:47:10 GMT
[2] => Server: Apache/1.3.29 (Unix)
[3] => X-Powered-By: PHP/4.3.8
[4] => Connection: close
[5] => Content-Type: text/html
)

Nothing unexpected, I believe. I didn't try to pose as a different useragent, but I don't think my server tries to cloak anything based on UA or IP - and my script certainly doesn't. I got a suggestion elsewhere to try a "packet sniffer". Never used one of those, but I may look into it.
Thanks for helping me think through this!
Rob
Post Reply