Page 1 of 1
Non-american characters not displayed
Posted: Wed Jul 30, 2008 8:19 am
by shiznatix
This seams to be a bit noobish of me but I can't figure it out. Basically, I have characters like ä, and € in a string that is stored in my database. The field is a varchar(255) latin1_swedish_ci. The string is this:
en jakaa €32,000 edestä lippuja Pokerin SM-Kisoihi
which is some gibberish in Finnish. For some reason in FF, the euro sign and the ä are being displayed as the dreaded question mark things. If I just echo out the string, not from the variable, then all is well.
The problem is that I am getting all this data from an RSS2 feed via file_get_contents(); If I go directly to the feed URL then it shows the symbols perfectly. It is something with grabbing the data from my script that screws up the encoding. I checked the manual about file_get_contents() and found nothing about encoding so now I ask you guys, what the heck is going on here and how do I fix it?
Re: Non-american characters not displayed
Posted: Wed Jul 30, 2008 8:45 am
by ghurtado
did you try html_entities() ?
Re: Non-american characters not displayed
Posted: Wed Jul 30, 2008 9:52 am
by shiznatix
That somehow fixed the ä but the euro sign is still a no go.
Re: Non-american characters not displayed
Posted: Wed Jul 30, 2008 10:08 am
by dml
It's not so much that the encoding gets screwed up, it's that there's a point where bytes are getting transmitted in encoding X, but the recipient is expecting encoding Y, so the recipient interprets the bytes as a string that the sender didn't intend. So it's a question of isolating where this is happening.
There's a sequence of data transmissions, what is it? download from RSS, insert into database, retrieve from database, send to browser? And what's the encoding of each of those data transmissions?
RSS Server->Application // ???
Application->Mysql // Latin1?
Mysql->Application // Latin1?
Application->Browser //???
Whenever you receive some bytes, you should be able to make an assertion that these bytes spell out a given sequence of characters in a given encoding. If they don't spell out what you expect, then something has gone wrong upstream. I'm still looking for a really convenient method for making those assertions. One way is to run bin2hex on the string and make sure the Euro symbol is encoded correctly (0x80 in latin1, 0xe282ac in UTF8). Another way is to use the recode utility, or its associated PHP extension, to dump out a spelling of the string in the encoding you expect it to be in.
Code: Select all
$s = "€32,000 edestä";
// spell out the string, assuming it's utf8-encoded text
recode_string("utf8..dump-with-names", $s);
/*
prints to STDOUT:
20AC Eu symbole euro
0033 3 digit three
0032 2 digit two
002C , comma
0030 0 digit zero
0030 0 digit zero
0030 0 digit zero
0020 SP space
0065 e latin small letter e
0064 d latin small letter d
0065 e latin small letter e
0073 s latin small letter s
0074 t latin small letter t
00E4 a: latin small letter a with diaeresis
*/
Re: Non-american characters not displayed
Posted: Tue Aug 05, 2008 6:47 am
by shiznatix
For some reason I don't have recode_string() functions, not too worried about it but, I am still having problems. If I bin2hex() the string I get this:
656e206a616b6161208033322c303030206564657374e4206c697070756a6120506f6b6572696e20534d2d4b69736f696869
but when I use this website
http://www.string-functions.com/hex-string.aspx it comes out properly. Why does it work on that website but not on mine? The total string I am using is:
Code: Select all
en jakaa €32,000 edestä lippuja Pokerin SM-Kisoihi
Re: Non-american characters not displayed
Posted: Thu Aug 07, 2008 1:53 am
by shiznatix
**bump
Re: Non-american characters not displayed
Posted: Thu Aug 07, 2008 11:53 am
by dml
The "€" in the string is encoded as "\x80", so it's
cp1252. Mysql's "latin1" encoding is actually cp1252 and not iso-8859. These are mostly the same (ä is "\xe4" in both), but the Euro sign is one of the differences - it's "\xa4" in
iso-8859-15.
Assuming that the data is being fed from mysql to a web page, you can either change the header of the web page to suit the encoding (so "Content-Type: text/html; charset=cp1252"), or change the encoding to suit the header, for example if you set the "character-set-results" variable, mysql will send out results in the encoding you want.
The other option is to use entities. It looks like htmlentities() doesn't convert the Euro sign - it doesn't seem to be included in get_html_translation_table(). A solution for the Euro sign is to str_replace "\x80" with "€". In order to be fully correct, you'll have to do the same for any character in the data whose cp1252 encoding is different from the encoding declared in the header, so it's better just to get the encoding right.
Re: Non-american characters not displayed
Posted: Thu Aug 07, 2008 12:08 pm
by ghurtado
Shouldn't html_entities always be used before displaying any data from MySQL in a web page, regardless of the underlying charset problems?
Re: Non-american characters not displayed
Posted: Thu Aug 07, 2008 12:48 pm
by dml
When you say htmlentities should always be used, do you mean from a security point of view or for another reason?
From a security point of view, htmlspecialchars should be sufficient to prevent XSS type attacks. Of course, security policies shouldn't be based on assertions of the type "measure X
should be sufficient to prevent attack Y", and it probably gives some sort of defense in depth to turn the extra characters into entities, but I wouldn't be able to say what extra protection it gives. It's not like htmlentities magically solves any character problem: from what I can see, it gives programmers in the English speaking world access to
a smattering of accented characters and mathematical symbols without having to work out what encoding to output them in.
Actually, it looks like htmlentities will solve the problem, if the encoding is given as an argument:
Code: Select all
// prints out "€"
echo htmlentities("\x80", ENT_QUOTES, "cp1252");
Re: Non-american characters not displayed
Posted: Thu Aug 07, 2008 2:27 pm
by ghurtado
dml wrote:When you say htmlentities should always be used, do you mean from a security point of view or for another reason?
To be honest, I wasn't thinking of any specific reason. I have always used it, probably half out of habit, half out of being a Spaniard in a country full of American keyboards.
Thank you for your insight. I suppose security is one good reason to use html entities. I always assumed there was some sort of W3C HTML standard that covered the usage of these entities, and that there was some sort of higher purpose to the function

Re: Non-american characters not displayed
Posted: Fri Aug 08, 2008 5:41 am
by shiznatix
dml, they worked just great. Thanks a billion