UTF-8 as encoding?

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

Post Reply
User avatar
MarK (CZ)
Forum Contributor
Posts: 239
Joined: Tue Apr 13, 2004 12:51 am
Location: Prague (CZ) / Vienna (A)
Contact:

UTF-8 as encoding?

Post by MarK (CZ) »

Well, not quite sure if it should go here because it also deals with MySQL and client side, but I think that it should go to Theory and design. Sorry if I'm wrong. Here it is:

I have a site which offers more languages (English, Czech, German, French, Swedish and other may come). Would it be any better if I would use UTF-8 as encoding instead of switching them for each language? What are the good and bad points of this solution?

Thanks :)
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

yeah, I think UTF-8 is the way to go in your situation. I really don't see any bad points there, except the fact that php does not support multibyte strings natively.
User avatar
MarK (CZ)
Forum Contributor
Posts: 239
Joined: Tue Apr 13, 2004 12:51 am
Location: Prague (CZ) / Vienna (A)
Contact:

Post by MarK (CZ) »

Weirdan wrote:except the fact that php does not support multibyte strings natively.
What does that mean? Sorry for my dumb questions :D
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

MarK (CZ) wrote:What does that mean? Sorry for my dumb questions :D
For multibyte encodings (such as UTF-8 ) you can't use your favorite string functions, examples would be strstr, strlen, substr, str_replace ans so forth. Instead there are mb_* equivalents exist. ( http://www.php.net/manual/en/ref.mbstring.php ). Although the possibility exist to overload standard string functions with their multibyte equivalents, there were rumours that mbstring function overloading feature isn't stable enough to be used on production servers.
User avatar
MarK (CZ)
Forum Contributor
Posts: 239
Joined: Tue Apr 13, 2004 12:51 am
Location: Prague (CZ) / Vienna (A)
Contact:

Post by MarK (CZ) »

Ok, another question on this: in MySQL4.1+ - If I used UTF-8 as encoding in database, would setting of the connection character set affect what charset will I get? If I used "mysql> SET NAMES 'latin2';" would I get the results as a fine 'latin2' string? I'm trying to understand this one and it's making me a lot of problems after I upgraded MySQL server (yeah, I'm reading the manual ;) but unfortunately I still can't get into it).

Thanks for your patience :)
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

MarK (CZ) wrote:Ok, another question on this: in MySQL4.1+ - If I used UTF-8 as encoding in database, would setting of the connection character set affect what charset will I get?
According to their manual, yes.
If I used "mysql> SET NAMES 'latin2';" would I get the results as a fine 'latin2' string?
Yes, if UTF-8 includes all the characters used in latin2 (I believe it does, but can't tell you for sure as I never used latin2).
I'm trying to understand this one and it's making me a lot of problems after I upgraded MySQL server (yeah, I'm reading the manual ;) but unfortunately I still can't get into it).
That's why I still use 4.0.x branch :)
User avatar
MarK (CZ)
Forum Contributor
Posts: 239
Joined: Tue Apr 13, 2004 12:51 am
Location: Prague (CZ) / Vienna (A)
Contact:

Post by MarK (CZ) »

Ok, I just need more help :(

My site is multi-language (you can see it here: OFP.info). It uses more character sets and another ones are possible to come so I need to be able to handle that.

I've come to decision to have eg. for the news one table with news for all language sections and using UTF-8 as character set. Then convert the strings to appropriate encoding using "SET NAMES 'latin2';" while getting them from db - that should avoid the multi-byte problems.

Is this a good way? Or is it better to use UTF-8 everywhere? Or some other way? Thanks for your comments! :D
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

Currently you can live with latin2 as all the languages you use does fit into it (not quite sure as for chezh, did I spell it right? ). But if you expect to add languages like Hebrew, Russian, Arabic, Ukrainian and so on, I'd suggest you to switch to UTF.
User avatar
MarK (CZ)
Forum Contributor
Posts: 239
Joined: Tue Apr 13, 2004 12:51 am
Location: Prague (CZ) / Vienna (A)
Contact:

Post by MarK (CZ) »

Yeah, czech (it's czech, goddamnit! :D ) uses latin2 (ISO 8859-2). But I'm worried about the problems with UTF-8 and multi-byte-chars support in php. If I couldn't use functions like StrLen() then it could be a bit problem for me. And changing the connection character set everytime (to get the encoding I want for each language) seems to me stupid - more requests to server. I don't know...

:(
timvw
DevNet Master
Posts: 4897
Joined: Mon Jan 19, 2004 11:11 pm
Location: Leuven, Belgium

Post by timvw »

it is not that strlen isn't available anymore... only you need to call mb_strlen instead ;)
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

You may try to use the option to overload standard string functions with their mb_ analogs...
User avatar
MarK (CZ)
Forum Contributor
Posts: 239
Joined: Tue Apr 13, 2004 12:51 am
Location: Prague (CZ) / Vienna (A)
Contact:

Post by MarK (CZ) »

What are the limitations/problems/tricky things of this MB_ solution?
Post Reply