What role does the current locale, as specified by setLocale(), play in the conversion of a string from one character encoding to another?
In the documentation notes for iconv(), there are some examples of this.
I'm struggling to understand why the locale should ever matter when converting a string from one known encoding to another. Can anyone shine some light on this for me?
I've considered that perhaps the locale is only considered by iconv() when //TRANSLIT is used. Is this true? If so, can anyone offer some specific iconv() examples that might help me understand why?
Thank you!
Locale and character encoding conversion with iconv()
Moderator: General Moderators
Re: Locale and character encoding conversion with iconv()
Here's an example of where users in different locales expect different transliterations of the same character.
From Å
I'd like to see other examples of transliterations if anyone can find them - it's fascinating when what looks like a technical issue turns out to be a cultural issue when you look more deeply into it. Somebody here mentions differing transliterations of Cyrillic to roman. I've tried running "?????" through iconv to see if it transliterates into "Chekov" in an English locale and "Tchekov" in a French locale, but I haven't gotten it to work.
From Å
Here's some test code. When I run it, it's only in a Danish locale that å->aa. It's transcribed to 'a' in Swedish and Norwegian locales.Since Å is a letter with a distinct sound, not an A with an accent, it is best to keep it when referring to Scandinavian words and names in other languages. However, in Danish and Norwegian, Aa is widely known as the old way of writing Å, used until first part of the 20th century, and a fully functional transcription for Å when using a foreign keyboard. Due to technical troubles with the Å. Å is in internet addresses also mostly spelled as Aa. In Swedish, where this transcription is less common, Å is often rendered simply A in internet addresses (internationalized domain names are still fairly uncommon).
Code: Select all
function assert_equals($expected, $got){
if($got!==$expected){
var_dump("EXPECT EQUALS FAILED", $expected, $got);
die();
}
}
function test($locale, $string, $expected){
assert_equals($locale, setlocale(LC_CTYPE, $locale));
assert_equals($expected, iconv('UTF-8', 'ASCII//TRANSLIT', $string));
}
$letter = "\xc3\xa5"; // å
test("C", $letter, "?");
test("da_DK", $letter, "aa"); // Danish: å->aa
test("sv_SE", $letter, "a"); // Swedish: å->a
test("no_NO", $letter, "a"); // Norwegian: å->a
Re: Locale and character encoding conversion with iconv()
Thank you. That actually makes sense. I wasn't able to reproduce your transliteration results with any of the locales on my Windows XP system (using your code), but the concept makes sense.
If you're curious, on my system "å" becomes "a" in all the locales I tried, including "Danish_Denmark", "Swedish_Sweden", and "Norwegian (Bokmål)_Norway".
A related question: Is there any need for transliteration when the output encoding is UTF-8? I assume that because UTF-8 can properly represent any Unicode character, there is no need for transliteration, and therefore no need to be locale-aware. True?
If you're curious, on my system "å" becomes "a" in all the locales I tried, including "Danish_Denmark", "Swedish_Sweden", and "Norwegian (Bokmål)_Norway".
A related question: Is there any need for transliteration when the output encoding is UTF-8? I assume that because UTF-8 can properly represent any Unicode character, there is no need for transliteration, and therefore no need to be locale-aware. True?
Re: Locale and character encoding conversion with iconv()
Like you, the only place I can think of at the moment where iconv might be locale-dependent is in transliterations to character sets that unlike utf8 don't have all the characters in the source string. I don't know for sure if there are other cases: it would be interesting to know.
Re: Locale and character encoding conversion with iconv()
I'll move ahead with the understanding that transcoding a string to UTF-8 with iconv("...", "UTF-8", ...) is not locale-dependent (with or without "//TRANSLIT"), but I hope there isn't a caveat to this that I don't fully understand yet. Thank you for your clear explanation of why the locale should be consulted for transliteration.
I propound that character encoding is the single most misunderstood subject among web application developers. Character encoding itself is a simple matter, but when it comes to practical application and building internationalized web sites, there are too many opportunities to get something wrong. With data being exchanged between so many resources (user input, web services, databases, flat files, etc.), it's far too easy to accept a datum from somewhere without considering how it is encoded. Also, if you are writing data to a database, you need to be cognizant of how your RDBMS is handling the data, and also the encoding being used for the RDBMS connection itself.
For anyone who's interested, here are some references I've found helpful:
Character Sets / Character Encoding Issues
Handling UTF-8 with PHP
UTF-8: The Secret of Character Encoding
MySQL 5.0 Reference Manual :: 9.1.4 Connection Character Sets and Collations
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
I propound that character encoding is the single most misunderstood subject among web application developers. Character encoding itself is a simple matter, but when it comes to practical application and building internationalized web sites, there are too many opportunities to get something wrong. With data being exchanged between so many resources (user input, web services, databases, flat files, etc.), it's far too easy to accept a datum from somewhere without considering how it is encoded. Also, if you are writing data to a database, you need to be cognizant of how your RDBMS is handling the data, and also the encoding being used for the RDBMS connection itself.
For anyone who's interested, here are some references I've found helpful:
Character Sets / Character Encoding Issues
Handling UTF-8 with PHP
UTF-8: The Secret of Character Encoding
MySQL 5.0 Reference Manual :: 9.1.4 Connection Character Sets and Collations
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Re: Locale and character encoding conversion with iconv()
I agree - it's as essential to know the encoding of a string as it is to know whether $weight is pounds or kilos, or whether $price is dollars or euros. And it is far too easy to get it wrong - on Aug 1 you restore a database from backup, and on Sep 1, a user reports that their quotation marks are showing up as question marks, and it's not obvious that the two events may be connected.
Those links are very useful. The Spolsky article is a good place to start for "why" knowledge, and the link at the top has concrete PHP-specific "how" knowledge.
Those links are very useful. The Spolsky article is a good place to start for "why" knowledge, and the link at the top has concrete PHP-specific "how" knowledge.
Re: Locale and character encoding conversion with iconv()
This site has some handy character collation charts, including charts of many locale-specific collations:
http://www.collation-charts.org/
It also provides links to some Unicode charts.
If anyone wants to examine the differences between the character transliterations of different locales, this might be a place to look.
http://www.collation-charts.org/
It also provides links to some Unicode charts.
If anyone wants to examine the differences between the character transliterations of different locales, this might be a place to look.