file_get_contents gets funny characters

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
alfmarius
Forum Newbie
Posts: 16
Joined: Thu Jul 10, 2008 5:14 am

file_get_contents gets funny characters

Post by alfmarius »

Im trying to dump a webpage which contains this text:

If you’re one of those

But using file_get_contents gives me this text:

If you’re one of those

The documentation says: As of PHP 6, the default encoding of the read data is UTF-8. You can specify a different encoding by creating a custom context or by changing the default using stream_default_encoding(). This flag cannot be used with FILE_BINARY.

Im thinking the problem lies in php trying to get this as latin1 instead of utf8, or something in that direction. The suggested function stream_default_encoding() doesnt even exist! Very strange.

So anyone have a clue?

thanks :)
fannnn
Forum Newbie
Posts: 6
Joined: Tue Jun 16, 2009 10:55 am

Re: file_get_contents gets funny characters

Post by fannnn »

afaik, file_get_contents() is binary safe, that means it gets the file unaltered, regardless of the encoding. you probably have a mistake in the way youre storing or echoing it after retrieving. show some more code.
alfmarius
Forum Newbie
Posts: 16
Joined: Thu Jul 10, 2008 5:14 am

Re: file_get_contents gets funny characters

Post by alfmarius »

Code: Select all

$site = file_get_contents("http://somesite");
die($site);
It's totally raw from the webpage im dumping.
fannnn
Forum Newbie
Posts: 6
Joined: Tue Jun 16, 2009 10:55 am

Re: file_get_contents gets funny characters

Post by fannnn »

try adding this before the die()-call:
header('Content-type: text/plain; charset=UTF-8');
alfmarius
Forum Newbie
Posts: 16
Joined: Thu Jul 10, 2008 5:14 am

Re: file_get_contents gets funny characters

Post by alfmarius »

OK, so after a lot of trial and failure debugging, I think I got it working. First I tried what you said, to no effect, then i screwed up and went totaly wrong direction, before i today came fresh to work after a good night sleep, and now i got your idea working. How odd..

Anyway, this now works:

Code: Select all

header('Content-type: text/html; charset=UTF-8');
$site = file_get_contents("<url>");
die($site);
However! I did a get_headers('<url>') and got this:

Code: Select all

Array
(
    [0] => HTTP/1.1 200 OK
    [1] => Date: Wed, 17 Jun 2009 10:22:50 GMT
    [2] => Server: Apache/2.2.3 (CentOS)
    [3] => Accept-Ranges: bytes
    [4] => Connection: close
    [5] => Content-Type: text/html; charset=UTF-8
)
What it looks like to me is that this page already have given header setting, so why it doesn't work the first time around beats me! If anyone got a clue, please enlighten me :)
alfmarius
Forum Newbie
Posts: 16
Joined: Thu Jul 10, 2008 5:14 am

Re: file_get_contents gets funny characters

Post by alfmarius »

After 2 days of debug (and a few more grey hairs) i found out what was the deal. It was the communication between PHP and MySQL.
Apparently this is defaulted to 'latin1', when it should be 'utf8'. So the quick fix is to do this after mysql_connect:

Code: Select all

mysql_query("SET NAMES 'utf8'");
I found a good article about this here: http://akrabat.com/2009/03/18/utf8-php-and-mysql/
So basicly i dont have to mess with any headers, just set this value..
Post Reply