Page 1 of 2

Unicode - do I/don't I??

Posted: Tue Jan 09, 2007 8:06 am
by georgeoc
Hi,

I'm writing an Open-Source app that will hopefully be used by people on a wide variety of hosting configurations. Hence, a database abstraction library seemed to be necessary. I chose AdoDB after some research, but it wouldn't be disastrous at this stage to change to another library, as my code isn't particularly coupled to it.

My app deals with messaging - it's a kind of hub for receiving and distributing messages between any number of formats - bulletin boards, email, RSS, NNTP, PDF, etc. I have been thinking for the last few days that perhaps I should address the issue of character encodings as soon as possible, so that I don't have to do it retrospectively when people start asking me why they get funny ??? marks in their emails.

UTF-8 seems the way to go unless I'm mistaken. I've had a look at this nice article, and have also found an interesting library from Harry Fuecks which may help.

There are a few points which worry me:
  • From the first article:
    PHP's internal encoding should be the same the one in which the PHP files are saved in. To set the encoding, call mb_internal_encoding at the very beginning of your script:
    mb_internal_encoding('UTF-8');
    Is there any way I can do the same thing but cater for a larger audience (i.e. those who don't have the mbstring extension installed)?
  • I don't see much, if anything, on setting the various stages of encoding and collation with AdoDB. If it is possible to combine AdoDB and UTF-8 successfully, please tell me what I need to do. Otherwise, what's my next option? One of the reasons I chose AdoDB was because of the XML schema library for creating and updating tables. To a relative SQL novice like me, it seemed attractive for my aims of greater portability and preventing bugs. So, my question I guess is if there is a good DB abstraction library which can support full UFT-8 (connections, queries, result sets, etc), and has the XML schema creation that a lot seem to boast? I looked at PEAR::MDB2 too (which also has the XML facility), but as far as I can tell the charset functions haven't been fully implemented yet. The library must be PHP 4 and 5 compatible.
  • In order to support a multitude of languages without encoding problems, am I right to go with UFT-8 now? Or is there a simpler solution? Or should I forget about it for now, get the app written (a few months away at least), and address encoding later, perhaps when MDB2 is fully Charset-aware?

Thanks for any advice you can give!

Re: Unicode - do I/don't I??

Posted: Tue Jan 09, 2007 8:18 am
by georgeoc
georgeoc wrote:From the first article:
PHP's internal encoding should be the same the one in which the PHP files are saved in. To set the encoding, call mb_internal_encoding at the very beginning of your script:
mb_internal_encoding('UTF-8');
Is there any way I can do the same thing but cater for a larger audience (i.e. those who don't have the mbstring extension installed)?
Or perhaps I just make it a requirement that users who want full language support must switch on the UTF-8 option in the config file, and must use a server which includes the mbstring extension. How about that?

Posted: Tue Jan 09, 2007 8:33 am
by feyd
It's not unreasonable.

Posted: Tue Jan 09, 2007 8:42 am
by georgeoc
Thanks - perhaps I'll do that then. How about my DB abstraction problem?

Posted: Tue Jan 09, 2007 8:45 am
by feyd
georgeoc wrote:Thanks - perhaps I'll do that then. How about my DB abstraction problem?
I'm not really seeing an abstraction problem. The abstraction code should have little to no impact on the database's operational handling of multilingual data. It's purely up to the database to support the encodings.

Posted: Tue Jan 09, 2007 8:56 am
by CoderGoblin
Just for your information... The major problem I had with postgres was ensuring the tablespaces were correctly stored as unicode. Once this was done I had no problems with ADODB.

Posted: Tue Jan 09, 2007 10:30 am
by georgeoc
feyd wrote:I'm not really seeing an abstraction problem. The abstraction code should have little to no impact on the database's operational handling of multilingual data. It's purely up to the database to support the encodings.
Sure, but I'd love not to have to learn the correct SQL commands to set the encoding for MySQL, Postgres, MSSQL, etc, etc. It would require extensive testing and research, which I know other people have done already. I just want to do $db->setEncoding('utf-8'); when I open the connection, and that's it if possible. Also, I want the tables to be created automatically from XML files during the installation process - I don't want to send any SQL commands at all until I start adding and retrieving data.

Posted: Tue Jan 09, 2007 10:43 am
by feyd
I could easily be wrong here, but I believe many databases change the encoding setting at various places. Particularly at the table level, database level and even server level. I don't recall the need to set it via a command sent to the server.

Posted: Tue Jan 09, 2007 10:50 am
by georgeoc
If I want to store data in a UTF-8 table, surely I need to specify that when I connect to the db and create the tables? From this post on the MDB2 developer's blog it seems like different commands need to be sent for each DBMS - there is no standard.

Posted: Tue Jan 09, 2007 3:27 pm
by Ambush Commander
Right. Sometimes, you have to trick the database: MySQL 4.0 doesn't support Unicode so you have to store as Latin text. It'll come out alright, but the collating will be messed up.

I haven't used AdoDB for a while, but I don't think it'll handle encoding for you. For a project I did a while back I had to fudge it with:

Code: Select all

$mysql_version = @mysql_get_server_info(); //may fail if not using mysql
        if ($mysql_version && strpos($mysql_version, '4.1') === 0) {
            //tell the MySQL 4.1 database we are sending utf-8 information
            $this->DB->Execute('SET NAMES utf8');
        }
Probably not a good idea, in retrospect.

Posted: Tue Jan 09, 2007 4:04 pm
by Maugrim_The_Reaper
ADOdb has no character encoding/collation methods - you actually do need to fudge it since DBs differ widely on this front. Not pretty in the slightest... It's easier if you focus on one DBMS like MySQL, but thats not always avoidable (coming from someone who generally prefers PostgreSQL as standard).
Or perhaps I just make it a requirement that users who want full language support must switch on the UTF-8 option in the config file, and must use a server which includes the mbstring extension. How about that?
Not required. The problem arises where PHP files are encoded differently than the set default. It causes internal parsing errors, creating not so much errors as corrupted output a lot of the time. I remember getting tons of odd output issues when a colleague added Simplified Chinese comments without thinking - it was funny the amount of time we spent hunting for that comment :).

The PHP source should be kept as closely as possible to ASCII for comments - which is where the problem most likely will arise if people comment in native languages outside ASCII and UTF-8. Multibyte encoding in strings is generally safe - at least I have whole sections of UTF-8 test strings in PHP files (which are UTF-8 encoded in my editor) and I've never needed to fiddle with php.ini or use the mbstring extension unless the strings are actually being manipulated - not just stored, echo'd, or sent to database.

It largely seems to effect comments - people think they are like plain text Strings, but PHP does process them, and will break when it hits something far outside the range of ASCII or a Western ISO.

If you ever get interested in something as delicate as allowing usernames/input in multibyte encodings you'll also find that PHP's PCRE functions (good for matching against native alphanumeric Unicode ranges with preg_match() etc.) can be fickle across many servers. Many PHP installations will not support the Unicode switch used in the regex. Another little compatibility problem that sinks supporting UTF more widely...

PHP6 - wish it were here already ;).

Harry Fueck's UTF library is very good where the presence of mbstring is not a certainty and you need to manipulate strings.

Posted: Tue Jan 09, 2007 4:06 pm
by Ambush Commander
Don't allow multibyte strings in usernames. 1. Users don't expect this sort of functionality and 2. It opens you to username doppelgangers (think phishing)

Posted: Tue Jan 09, 2007 5:42 pm
by georgeoc
Thanks for the comments guys.

Before I get too obsessed with this and really go down the UTF-8 path, I'd like an honest answer. Bearing in mind my project (messaging hub, described above), the fact that it should be as portable as possible (PHP 4 or 5, no special extensions required, MySQL, Postgres, Oracle, MSSQL, etc), and the fact that it should be as universally useful as possible (i.e. to people using any number of languages and alphabets), what would you do? Carry on with AdoDB with the 'fudges' such as Ambush Commander suggests, or use another library? Or should I just forget Unicode and cut off a whole group of potential users of my app?

Posted: Tue Jan 09, 2007 5:45 pm
by Ambush Commander
Just stuff it in a binary column. Problem solved. No collation, of course, but most users don't need that.

Posted: Tue Jan 09, 2007 5:49 pm
by Maugrim_The_Reaper
The main aim of being internationalised is to let people speaking and writing another language use the application. This usually focuses on two areas: input and ouput (with the PHP code being ASCII usually - not a huge deal). For input, keep the basics towards ASCII for registration etc. User contributed text, assuming it does not require PHP manipulation, will rely mainly on having a Unicode safe storage system - i.e. MySQL 4.1+ with the fudges (the fudging only takes place at installation - setting default collation/encoding of tables and text/varchar fields). Output is a little harder - any templates should be UTF-8 encoded (check your editor) and a UTF-8 encoding passed in the HTML head.