Page 1 of 1
file_get_contents gets funny characters
Posted: Tue Jun 16, 2009 11:30 am
by alfmarius
Im trying to dump a webpage which contains this text:
If you’re one of those
But using file_get_contents gives me this text:
If you’re one of those
The documentation says: As of PHP 6, the default encoding of the read data is UTF-8. You can specify a different encoding by creating a custom context or by changing the default using stream_default_encoding(). This flag cannot be used with FILE_BINARY.
Im thinking the problem lies in php trying to get this as latin1 instead of utf8, or something in that direction. The suggested function stream_default_encoding() doesnt even exist! Very strange.
So anyone have a clue?
thanks

Re: file_get_contents gets funny characters
Posted: Tue Jun 16, 2009 11:32 am
by fannnn
afaik, file_get_contents() is binary safe, that means it gets the file unaltered, regardless of the encoding. you probably have a mistake in the way youre storing or echoing it after retrieving. show some more code.
Re: file_get_contents gets funny characters
Posted: Tue Jun 16, 2009 11:36 am
by alfmarius
Code: Select all
$site = file_get_contents("http://somesite");
die($site);
It's totally raw from the webpage im dumping.
Re: file_get_contents gets funny characters
Posted: Tue Jun 16, 2009 11:38 am
by fannnn
try adding this before the die()-call:
header('Content-type: text/plain; charset=UTF-8');
Re: file_get_contents gets funny characters
Posted: Wed Jun 17, 2009 5:26 am
by alfmarius
OK, so after a lot of trial and failure debugging, I think I got it working. First I tried what you said, to no effect, then i screwed up and went totaly wrong direction, before i today came fresh to work after a good night sleep, and now i got your idea working. How odd..
Anyway, this now works:
Code: Select all
header('Content-type: text/html; charset=UTF-8');
$site = file_get_contents("<url>");
die($site);
However! I did a
get_headers('<url>') and got this:
Code: Select all
Array
(
[0] => HTTP/1.1 200 OK
[1] => Date: Wed, 17 Jun 2009 10:22:50 GMT
[2] => Server: Apache/2.2.3 (CentOS)
[3] => Accept-Ranges: bytes
[4] => Connection: close
[5] => Content-Type: text/html; charset=UTF-8
)
What it looks like to me is that this page
already have given header setting, so why it doesn't work the first time around beats me! If anyone got a clue, please enlighten me

Re: file_get_contents gets funny characters
Posted: Wed Jun 17, 2009 8:32 am
by alfmarius
After 2 days of debug (and a few more grey hairs) i found out what was the deal. It was the communication between PHP and MySQL.
Apparently this is defaulted to 'latin1', when it should be 'utf8'. So the quick fix is to do this after mysql_connect:
I found a good article about this here:
http://akrabat.com/2009/03/18/utf8-php-and-mysql/
So basicly i dont have to mess with any headers, just set this value..