Page 1 of 1

How do YOU handle Character Encoding?

Posted: Wed Jul 19, 2006 9:40 pm
by Ambush Commander
Just curious. I'm trying to figure out whether or not I should build in character encoding into a parser I built, and it's quite a knotty issue.

Posted: Thu Jul 20, 2006 12:00 am
by daedalus__
?

Posted: Thu Jul 20, 2006 1:30 am
by fastfingertips
You can set the encoding in the escaping method, so i suppose that this process will be triggered when you select data from DB. If you have also translations depending on how are you handle them (DB or file) you may decide to move the process from DL to View.

Posted: Thu Jul 20, 2006 6:02 am
by Ambush Commander
Encoding in the database is one issue, but most people stick it in as Latin-1 regardless of what character set they're using. It's only important when you rely on the database's collation functionality. MySQL 4.0 doesn't have good Unicode support (MySQL 4.1 basically fixes all the problems), so this is what most of these people do.

However, character encoding also applies to the output and processing of data. Here are some issues:

1. Do you use Unicode? There is absolutely no reason you shouldn't be using Unicode. Read this to find out more about Unicode in general and common issues: http://www.phpwact.org/php/i18n/charsets

2. Do you explicitly define the character set by setting header('Content-type: text/html; charset=utf-8');? Do you specify the http-equiv meta header?

3. Do you assume that everything user submitted is in the correct encoding? In terms of forms, this generally isn't a huge problem, because even though virtually no one specifies accept-charset, the browser usually is smart enough to encode it according to the encoding of the form itself.

However, start offering other places where user uploads can get in like file uploads, and you can't assume anything about it. You have to figure out what the encoding is, get rid of the byte order mark (if there is one), and convert it to UTF-8 (if it isn't that already).

4. Do you account for low quality browsers mangling textareas with Unicode characters? MediaWiki fixes this by transparently converting all Unicode characters to entities when the trouble browsers show up. Do you?