[SOLVED] This query mangles text

Questions about the MySQL, PostgreSQL, and most other databases, as well as using it with PHP can be asked here.

Moderator: General Moderators

Post Reply
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

[SOLVED] This query mangles text

Post by Ambush Commander »

Code: Select all

(mysql): INSERT INTO `comments` ( `id`, `ipaddress`, `timestamp`, `username`, `contents`, `parent` ) VALUES (49,'127.0.0.1',1134099673,'Ambush Commander','华夏',NULL)
I don't understand... the data goes in, but the contents just get badly mangled. Why? Why?

Code: Select all

CREATE TABLE `comments` (
  `id` int(10) unsigned NOT NULL default '0',
  `ipaddress` varchar(15) collate utf8_unicode_ci default NULL,
  `timestamp` int(10) unsigned NOT NULL default '0',
  `username` varchar(30) collate utf8_unicode_ci default NULL,
  `parent` int(10) unsigned default NULL,
  `contents` text collate utf8_unicode_ci NOT NULL,
  PRIMARY KEY  (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
It comes out like...

Code: Select all

�?��?
Using AdoDB. UTF-8. Don't want to entityize my chinese.
Last edited by Ambush Commander on Fri Dec 09, 2005 12:24 pm, edited 1 time in total.
User avatar
pickle
Briney Mod
Posts: 6445
Joined: Mon Jan 19, 2004 6:11 pm
Location: 53.01N x 112.48W
Contact:

Post by pickle »

Not too sure - but is it possible that the DB is storing the data properly, but when you view it, it gets mangled? Could the problem be with what you're using to view the results rather than MySQL?
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

I tried viewing it via PHPMyAdmin and it looked like the data was irreparably damaged. It seems like certain octets are getting replaced by ?. Not all chinese characters get mangled, and most of the common multibyte european letters survive unscathed.

Is
Roja
Tutorials Group
Posts: 2692
Joined: Sun Jan 04, 2004 10:30 pm

Post by Roja »

I'd ask on the adodb support forum.
User avatar
RobertGonzalez
Site Administrator
Posts: 14293
Joined: Tue Sep 09, 2003 6:04 pm
Location: Fremont, CA, USA

Post by RobertGonzalez »

Have you tried viewing the results in different browsers. I have had an experience like this where IE displays special characters fine but FF chokes on them.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

It does work in IE!

But why?
User avatar
php3ch0
Forum Contributor
Posts: 212
Joined: Sun Nov 13, 2005 7:35 am
Location: Folkestone, Kent, UK

Post by php3ch0 »

What sort of data are you trying to put in?

Have you selected the correct data type?
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

UTF-8 string. Intercepting the post data and echoing it onto a UTF-8 web page seems to work.

Edit - It definitely is not a problem with AdoDB

Code: Select all

header('Content-type: text/html; charset=utf-8');

$link = mysql_connect('localhost', 'php', '****')
   or die('Could not connect: ' . mysql_error());
mysql_select_db('wikistatus') or die('Could not select database');

$string = chr(hexdec('E5')).chr(hexdec('8D')).chr(hexdec('8E')); //unicode bytes
$data = mysql_real_escape_string($string);

//these are the same
echo $string . '<br />';
echo $data . '<br />';

//This doesn't fail
mysql_query("INSERT `comments` (`id`,`contents`) VALUES ('435','$data')") or die('Query failed: ' . mysql_error());

//But viewing the result row, the 8D octet is turned into a ? while the E5
//and 8E ones are preserved. 
//Interestingly, 8D is not a printable character in most screens
Edit 2 - I can't find 8D in any of the extended character sets. How does Internet Explorer know what I'm talking about? Plus... I really, really don't want to resort to using Binary for my text fields... Hrmm..
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Changing my query to:

Code: Select all

mysql_query("INSERT `comments` (`id`,`contents`) VALUES ('435',_utf8'$data')") or die('Query failed: ' . mysql_error());
Seems to work. But I'm fairly certain this will not work with MySQL 4.0. Still struggling!
User avatar
RobertGonzalez
Site Administrator
Posts: 14293
Joined: Tue Sep 09, 2003 6:04 pm
Location: Fremont, CA, USA

Post by RobertGonzalez »

Dude, I wish I knew what IE was seeing that other browsers aren't. I do know that sometimes character like accented vowels or umlauts show up as little diamonds with a question mark inside them in FF. I don't know if it is a FF limitation, security issue or if the developers just skipped over that part.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Internet Explorer is guessing. That's it.

Still pulling my hair out! This is probably something people have experienced before... MySQL 4.1 only code is not an option for me unfortunantely.

Edit
A common strategy for applications built using MySQL 4.0 that needed to handle data in multi-byte character encodings (such as UTF-8) that were not supported natively was to simply store the data in VARCHAR fields (or TEXT, CHAR, etc). The application would either ignore the fact that MySQL would simply sort the data incorrectly for strings that weren't really in the character set that the server thought they were, or do the sorting in the application. That strategy worked fine, you just had to handle all of the encoding issues in your application, and you also lost the ability to take full advantage of MySQL's full-text searching.
Vrm. So why doesn't this particular character insert correctly?

Edit 2 - It just gets screwier and screwier. MySQL 4.0 can handle it fine without the _utf8, but MySQL 4.1 can't!

Edit 3 - And then magically, via PHPMyAdmin,

Code: Select all

INSERT `comments` (
`id` ,
`contents`
)
VALUES (
'100', '&#21326;'
)
works.

Edit 4 It appears PHP is opening a connection to MySQL via Latin1 while PHPMyAdmin is using Unicode. It is a ADOdb problem!
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Fixed for 4.1

The problem was I was sending data over a latin1 wire when it actually was utf-8. Fixed by executing SET NAMES utf8 right after opening the connection (thus, the _utf8 solution did work, but this is a much stabler way of doing it). Still trying to find the MySQL 4.0 equivalent.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Final Solution

A little known change of behavior that occurred when MySQL 4.0 went 4.1 was how MySQL handles characters that are not in the character set we are operating under.

In 4.0, MySQL was very, very tolerant and would let everything save murder get past it.

In 4.1, MySQL was a bit more strict. If you said the connection was latin1, and you sent Unicode data that had octets that weren't valid latin1, MySQL would translate them into ?, breaking the Unicode word.

The trick, then, is to change the charset to UTF-8 only when running MySQL 4.1, as this directive 1) doesn't exist and 2) isn't necessary in MySQL 4.0.

The PHP manual trumps the MySQL manual in this aspect: it tells you exactly what has changed exactly on that page, while MySQL you have to do some hunting around: changes aren't that obvious.
Post Reply