Page 1 of 1

[SOLVED] This query mangles text

Posted: Thu Dec 08, 2005 9:43 pm
by Ambush Commander

Code: Select all

(mysql): INSERT INTO `comments` ( `id`, `ipaddress`, `timestamp`, `username`, `contents`, `parent` ) VALUES (49,'127.0.0.1',1134099673,'Ambush Commander','华夏',NULL)
I don't understand... the data goes in, but the contents just get badly mangled. Why? Why?

Code: Select all

CREATE TABLE `comments` (
  `id` int(10) unsigned NOT NULL default '0',
  `ipaddress` varchar(15) collate utf8_unicode_ci default NULL,
  `timestamp` int(10) unsigned NOT NULL default '0',
  `username` varchar(30) collate utf8_unicode_ci default NULL,
  `parent` int(10) unsigned default NULL,
  `contents` text collate utf8_unicode_ci NOT NULL,
  PRIMARY KEY  (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
It comes out like...

Code: Select all

�?��?
Using AdoDB. UTF-8. Don't want to entityize my chinese.

Posted: Fri Dec 09, 2005 9:59 am
by pickle
Not too sure - but is it possible that the DB is storing the data properly, but when you view it, it gets mangled? Could the problem be with what you're using to view the results rather than MySQL?

Posted: Fri Dec 09, 2005 10:04 am
by Ambush Commander
I tried viewing it via PHPMyAdmin and it looked like the data was irreparably damaged. It seems like certain octets are getting replaced by ?. Not all chinese characters get mangled, and most of the common multibyte european letters survive unscathed.

Is

Posted: Fri Dec 09, 2005 10:07 am
by Roja
I'd ask on the adodb support forum.

Posted: Fri Dec 09, 2005 10:11 am
by RobertGonzalez
Have you tried viewing the results in different browsers. I have had an experience like this where IE displays special characters fine but FF chokes on them.

Posted: Fri Dec 09, 2005 10:14 am
by Ambush Commander
It does work in IE!

But why?

Posted: Fri Dec 09, 2005 10:15 am
by php3ch0
What sort of data are you trying to put in?

Have you selected the correct data type?

Posted: Fri Dec 09, 2005 10:16 am
by Ambush Commander
UTF-8 string. Intercepting the post data and echoing it onto a UTF-8 web page seems to work.

Edit - It definitely is not a problem with AdoDB

Code: Select all

header('Content-type: text/html; charset=utf-8');

$link = mysql_connect('localhost', 'php', '****')
   or die('Could not connect: ' . mysql_error());
mysql_select_db('wikistatus') or die('Could not select database');

$string = chr(hexdec('E5')).chr(hexdec('8D')).chr(hexdec('8E')); //unicode bytes
$data = mysql_real_escape_string($string);

//these are the same
echo $string . '<br />';
echo $data . '<br />';

//This doesn't fail
mysql_query("INSERT `comments` (`id`,`contents`) VALUES ('435','$data')") or die('Query failed: ' . mysql_error());

//But viewing the result row, the 8D octet is turned into a ? while the E5
//and 8E ones are preserved. 
//Interestingly, 8D is not a printable character in most screens
Edit 2 - I can't find 8D in any of the extended character sets. How does Internet Explorer know what I'm talking about? Plus... I really, really don't want to resort to using Binary for my text fields... Hrmm..

Posted: Fri Dec 09, 2005 10:34 am
by Ambush Commander
Changing my query to:

Code: Select all

mysql_query("INSERT `comments` (`id`,`contents`) VALUES ('435',_utf8'$data')") or die('Query failed: ' . mysql_error());
Seems to work. But I'm fairly certain this will not work with MySQL 4.0. Still struggling!

Posted: Fri Dec 09, 2005 10:36 am
by RobertGonzalez
Dude, I wish I knew what IE was seeing that other browsers aren't. I do know that sometimes character like accented vowels or umlauts show up as little diamonds with a question mark inside them in FF. I don't know if it is a FF limitation, security issue or if the developers just skipped over that part.

Posted: Fri Dec 09, 2005 10:38 am
by Ambush Commander
Internet Explorer is guessing. That's it.

Still pulling my hair out! This is probably something people have experienced before... MySQL 4.1 only code is not an option for me unfortunantely.

Edit
A common strategy for applications built using MySQL 4.0 that needed to handle data in multi-byte character encodings (such as UTF-8) that were not supported natively was to simply store the data in VARCHAR fields (or TEXT, CHAR, etc). The application would either ignore the fact that MySQL would simply sort the data incorrectly for strings that weren't really in the character set that the server thought they were, or do the sorting in the application. That strategy worked fine, you just had to handle all of the encoding issues in your application, and you also lost the ability to take full advantage of MySQL's full-text searching.
Vrm. So why doesn't this particular character insert correctly?

Edit 2 - It just gets screwier and screwier. MySQL 4.0 can handle it fine without the _utf8, but MySQL 4.1 can't!

Edit 3 - And then magically, via PHPMyAdmin,

Code: Select all

INSERT `comments` (
`id` ,
`contents`
)
VALUES (
'100', '&#21326;'
)
works.

Edit 4 It appears PHP is opening a connection to MySQL via Latin1 while PHPMyAdmin is using Unicode. It is a ADOdb problem!

Posted: Fri Dec 09, 2005 11:53 am
by Ambush Commander
Fixed for 4.1

The problem was I was sending data over a latin1 wire when it actually was utf-8. Fixed by executing SET NAMES utf8 right after opening the connection (thus, the _utf8 solution did work, but this is a much stabler way of doing it). Still trying to find the MySQL 4.0 equivalent.

Posted: Fri Dec 09, 2005 12:24 pm
by Ambush Commander
Final Solution

A little known change of behavior that occurred when MySQL 4.0 went 4.1 was how MySQL handles characters that are not in the character set we are operating under.

In 4.0, MySQL was very, very tolerant and would let everything save murder get past it.

In 4.1, MySQL was a bit more strict. If you said the connection was latin1, and you sent Unicode data that had octets that weren't valid latin1, MySQL would translate them into ?, breaking the Unicode word.

The trick, then, is to change the charset to UTF-8 only when running MySQL 4.1, as this directive 1) doesn't exist and 2) isn't necessary in MySQL 4.0.

The PHP manual trumps the MySQL manual in this aspect: it tells you exactly what has changed exactly on that page, while MySQL you have to do some hunting around: changes aren't that obvious.