It's not so much that the encoding gets screwed up, it's that there's a point where bytes are getting transmitted in encoding X, but the recipient is expecting encoding Y, so the recipient interprets the bytes as a string that the sender didn't intend. So it's a question of isolating where this is happening.
There's a sequence of data transmissions, what is it? download from RSS, insert into database, retrieve from database, send to browser? And what's the encoding of each of those data transmissions?
RSS Server->Application // ???
Application->Mysql // Latin1?
Mysql->Application // Latin1?
Application->Browser //???
Whenever you receive some bytes, you should be able to make an assertion that these bytes spell out a given sequence of characters in a given encoding. If they don't spell out what you expect, then something has gone wrong upstream. I'm still looking for a really convenient method for making those assertions. One way is to run bin2hex on the string and make sure the Euro symbol is encoded correctly (0x80 in latin1, 0xe282ac in UTF8). Another way is to use the recode utility, or its associated PHP extension, to dump out a spelling of the string in the encoding you expect it to be in.
Code: Select all
$s = "€32,000 edestä";
// spell out the string, assuming it's utf8-encoded text
recode_string("utf8..dump-with-names", $s);
/*
prints to STDOUT:
20AC Eu symbole euro
0033 3 digit three
0032 2 digit two
002C , comma
0030 0 digit zero
0030 0 digit zero
0030 0 digit zero
0020 SP space
0065 e latin small letter e
0064 d latin small letter d
0065 e latin small letter e
0073 s latin small letter s
0074 t latin small letter t
00E4 a: latin small letter a with diaeresis
*/