file_get_contents() gives slightly altered file contents

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
tokyotech
Forum Newbie
Posts: 1
Joined: Fri Dec 26, 2008 12:04 am

file_get_contents() gives slightly altered file contents

Post by tokyotech »

I have a simple echo of a website's source:

Code: Select all

 
echo file_get_contents('http://www.threadless.com/blogs/blogs');    
 
A small portion of the output is a little fishy:

Code: Select all

 
<a class="pagea selected" href="/blogs/blogs?token=ccaea4f99cbadd8262c148c86e1d8b06&uuid=5abf8a35510975e77f4618b544f7fe65/page,1">1</a>
 
When viewing the page source of the actual page in a browser, that same area is:

Code: Select all

 
<a class="pagea selected" href="/blogs/blogs/page,1">1</a>
 
Image

The file_get_contents() actually returns a slightly different contents than the actual page! The function seems to have added "?token=randomString" to all of the page traversal URL's. I'm working on a web crawler and these weird links are screwing up the crawling.
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: file_get_contents() gives slightly altered file contents

Post by requinix »

What you're seeing is a result of session tracking. When file_get_contents retrieves the page the server doesn't see any cookie information being sent as well, so the server creates a new session and passes along the session data through links. PHP does the same thing when configured properly.

Clear your brower's cache (specifically the cookies), restart it, and visit the page. I bet you'll see something slightly different then.

To fix, strip out the token= and uuid= parts (it should be the same everywhere). You could use cURL to get the page and send cookies at the same time, but you'd probably have to send valid data else the server assume corruption and tack on those two parts anyways.

Code: Select all

$html = preg_replace('/token=[a-z0-9]{32}&uuid=[a-z0-9]{32}/', "", $html);
Now, what's really fishy is why you're copying this site and that the added token and uuid are bad.
Post Reply