Page 1 of 1
UTF-8 as encoding?
Posted: Thu Nov 18, 2004 10:13 am
by MarK (CZ)
Well, not quite sure if it should go here because it also deals with MySQL and client side, but I think that it should go to Theory and design. Sorry if I'm wrong. Here it is:
I have a site which offers more languages (English, Czech, German, French, Swedish and other may come). Would it be any better if I would use UTF-8 as encoding instead of switching them for each language? What are the good and bad points of this solution?
Thanks

Posted: Thu Nov 18, 2004 1:45 pm
by Weirdan
yeah, I think UTF-8 is the way to go in your situation. I really don't see any bad points there, except the fact that php does not support multibyte strings natively.
Posted: Thu Nov 18, 2004 2:27 pm
by MarK (CZ)
Weirdan wrote:except the fact that php does not support multibyte strings natively.
What does that mean? Sorry for my dumb questions

Posted: Thu Nov 18, 2004 3:39 pm
by Weirdan
MarK (CZ) wrote:What does that mean? Sorry for my dumb questions

For multibyte encodings (such as UTF-8 ) you can't use your favorite string functions, examples would be strstr, strlen, substr, str_replace ans so forth. Instead there are mb_* equivalents exist. (
http://www.php.net/manual/en/ref.mbstring.php ). Although the possibility exist to overload standard string functions with their multibyte equivalents, there were rumours that mbstring function overloading feature isn't stable enough to be used on production servers.
Posted: Thu Nov 18, 2004 3:54 pm
by MarK (CZ)
Ok, another question on this: in MySQL4.1+ - If I used UTF-8 as encoding in database, would setting of the connection character set affect what charset will I get? If I used "mysql> SET NAMES 'latin2';" would I get the results as a fine 'latin2' string? I'm trying to understand this one and it's making me a lot of problems after I upgraded MySQL server (yeah, I'm reading the manual

but unfortunately I still can't get into it).
Thanks for your patience

Posted: Thu Nov 18, 2004 5:41 pm
by Weirdan
MarK (CZ) wrote:Ok, another question on this: in MySQL4.1+ - If I used UTF-8 as encoding in database, would setting of the connection character set affect what charset will I get?
According to their manual, yes.
If I used "mysql> SET NAMES 'latin2';" would I get the results as a fine 'latin2' string?
Yes, if UTF-8 includes all the characters used in latin2 (I believe it does, but can't tell you for sure as I never used latin2).
I'm trying to understand this one and it's making me a lot of problems after I upgraded MySQL server (yeah, I'm reading the manual

but unfortunately I still can't get into it).
That's why I still use 4.0.x branch

Posted: Mon Dec 13, 2004 12:46 pm
by MarK (CZ)
Ok, I just need more help
My site is multi-language (you can see it here:
OFP.info). It uses more character sets and another ones are possible to come so I need to be able to handle that.
I've come to decision to have eg. for the news one table with news for all language sections and using UTF-8 as character set. Then convert the strings to appropriate encoding using "SET NAMES 'latin2';" while getting them from db - that should avoid the multi-byte problems.
Is this a good way? Or is it better to use UTF-8 everywhere? Or some other way? Thanks for your comments!

Posted: Tue Dec 14, 2004 7:02 am
by Weirdan
Currently you can live with latin2 as all the languages you use does fit into it (not quite sure as for chezh, did I spell it right? ). But if you expect to add languages like Hebrew, Russian, Arabic, Ukrainian and so on, I'd suggest you to switch to UTF.
Posted: Wed Dec 15, 2004 1:30 am
by MarK (CZ)
Yeah, czech (it's czech, goddamnit!

) uses latin2 (ISO 8859-2). But I'm worried about the problems with UTF-8 and multi-byte-chars support in php. If I couldn't use functions like StrLen() then it could be a bit problem for me. And changing the connection character set everytime (to get the encoding I want for each language) seems to me stupid - more requests to server. I don't know...

Posted: Wed Dec 15, 2004 3:07 am
by timvw
it is not that strlen isn't available anymore... only you need to call mb_strlen instead

Posted: Wed Dec 15, 2004 3:19 am
by Weirdan
You may try to use the option to overload standard string functions with their mb_ analogs...
Posted: Thu Dec 16, 2004 9:46 am
by MarK (CZ)
What are the limitations/problems/tricky things of this MB_ solution?