Page 1 of 1

Whence the page size variation?

Posted: Tue Nov 02, 2004 6:39 am
by RobBroekhuis
I've noticed that page sizes for .php pages, as reported by the number of bytes in my server logs, always move up and down a little bit. But the issue got a little more of my attention when I noticed this morning that one of my main database-generated pages varies by as much as 6000 bytes (between 35000 and 41000). The database isn't changing, and the page as presented to the user isn't changing. There is no querystring, nothing fancy. So where do the size variations come from?
Please set me straight - thanks!
Rob

Posted: Tue Nov 02, 2004 7:02 am
by kettle_drum
Does the page show their username? Is there a time and/or date shown somewhere on the page? Try running the page yourself and seeing what size you get.

Posted: Tue Nov 02, 2004 11:06 am
by rehfeld
maybe its throwing error messages? check which user agent conforms to the oddly sized pages, and maybe try viewing the site w/ that browser.


you could also temporarily use output buffering to capture the output sent to the user and log it, so you can look at it

Code: Select all

<?php
ob_start(); // put this at the very beg of the script, make sure its beforeany includes

// entires script goes here

$output = ob_get_contents(); // put this at very end of script, after all includes

// write the $output to a file
$filename = time() . '.html';
$fp = fopen($filename, 'w');
fwrite($fp, $output);
fclose($fp);

echo $output; // still gotta show it to the user or they wont like your website much

?>

partial answers...

Posted: Tue Nov 02, 2004 12:21 pm
by RobBroekhuis
To partly answer your followup questions...
No - the html content of the page is not changing (no username, date, or other variations). When I load the page twice into IE, "view source", and save as text file, the two text files have identical size. The two log entries show a different number of bytes.
So I get the size variations for my own requests - but the most extreme variations I've seen come when Googlebot requests the same page twice. Does that help in understanding the issue?
Will the output buffering approach capture more than just the page html? I think I'll give it a try, anyway - thanks for the suggestion.

Posted: Tue Nov 02, 2004 12:27 pm
by rehfeld
no it will only capture the html

if using php5, i know how to get the headers, but not on php4.....

what you could do, is use a local script to request the page in question from your site, and spoof the diff useragents that google is using(since you know its casuing diff page sizes)

if you do that, getting the headers is a snap(and works in php4)

Code: Select all

<?php

$ua = 'googles uagent from your log file'; // make sure you run the script for each of the diff ua's google uses
// google sometimes requests the same page more than once, and pretends to be a diff browser, to see if your sniffing for google trying to feed it keywords
ini_set('user_agent', $ua);

$url = 'http://example.org'; // your page in question

$fp = fopen($url, 'r');
$meta_data = stream_get_meta_data($fp);


echo '<pre>';
print_r($meta_data);


?>
oh and small size variations could easily be due to setting cookies

Posted: Tue Nov 02, 2004 12:41 pm
by RobBroekhuis
Rehfeld,
I implemented the output buffering (by the way, the echo statement at the end is not necessary - php will clear its buffer when it runs its course). With very curious results: with the output buffering in place, I no longer get varying load sizes - they are all exactly the same, at the low end of the previously noted range. The sizes of the files written are also all the same (a few bytes smaller than the load size), and the "view source" saved version is somewhat larger (I suspect because CRs are converted to CRLFs). So the buffering suppresses the number of "extra bytes" sent. I wonder if, during processing, php or Apache sends "keep-alive" protocol bytes to let the client's browser know it's still working? Just a random guess...

Posted: Tue Nov 02, 2004 12:44 pm
by rehfeld
thats a good theory. def could be happening.

but a few thousand bytes seems more than what a header or 2 would cause.

im thinking php is throwing errors.

if it was throwing 'headers already sent' errors, like when you try to start a session after sending some html, output buffering would eliminate those errors, and it would mask the problem.

Posted: Tue Nov 02, 2004 1:12 pm
by rehfeld
ive been wanting to make something like this anyway, just hadnt got around to it.

this should help you out.

Code: Select all

<?php



if (isSet($_POST['user_agent'])) {
    $user_agent = $_POST['user_agent'];
} else {
    $user_agent = $_SERVER['HTTP_USER_AGENT'];
}

ini_set('user_agent', $user_agent);



$html = '';
if (!empty($_POST['url'])) {

    $fp = @fopen($_POST['url'], 'r');

    if ($fp) {

        while (!feof($fp)) {
            $html .= fread($fp, 1024);
        }

        fclose($fp);

    }

}






?>

<form method="post" action="<?php echo $_SERVER['PHP_SELF']; ?>">

<p>Url, must be of the form http://example.org<br>
    <input type="text" size="100" name="url" value="<?php echo $_POST['url']; ?>"></p>

<p>The user Agent you want to pretend to be(defaults to the user agent of your browser)<br>
    <input type="text" size="100" name="user_agent" value="<?php echo $user_agent; ?>"></p>

<p><input type="submit"></p>
</form>

<hr>

<pre>

<?php if (isSet($http_response_header)) print_r($http_response_header); ?>

<hr>


<?php echo htmlentities($html); ?>

</pre>

Posted: Wed Nov 03, 2004 6:57 am
by RobBroekhuis
Rehfeld,
That's a useful little script - I put it in place as a testing playground. The html bit comes through just as it would in a browser (i.e., what I see with "view source"). The header bit is given as:
Array
(
[0] => HTTP/1.1 200 OK
[1] => Date: Wed, 03 Nov 2004 12:47:10 GMT
[2] => Server: Apache/1.3.29 (Unix)
[3] => X-Powered-By: PHP/4.3.8
[4] => Connection: close
[5] => Content-Type: text/html
)

Nothing unexpected, I believe. I didn't try to pose as a different useragent, but I don't think my server tries to cloak anything based on UA or IP - and my script certainly doesn't. I got a suggestion elsewhere to try a "packet sniffer". Never used one of those, but I may look into it.
Thanks for helping me think through this!
Rob