Page 1 of 1

UTF-8 MySQL tables, HTML output and more...

Posted: Tue Aug 17, 2010 10:17 am
by alex.barylski
I have a very confusing issue, I hope someone with more experience in dealing with UTF can clear up for me.

I have some legacy data that was entered using Access. I exported this data into MySQL tables, having never touched the charset of the MySQL tables, I assume they remained as latin1.

Some characters entered in Access were special characters, such as the ° or ® and others.

When I query this data from the MySQL table, all is well, but I need to convert the result into JSON and the default JSON methods (json_encode/json_decode) deal with UTF-8 only. So before passing the array to these methods I would iterate over the array and htmlentities() or now utf8_encode() which seemed to satisfy the JSON methods. Without this operation, JSON would choke on arrays that had fields with funny characters.

Then I got to thinking, I could probably avoid manually converting the charset each time I return an array by simply changing the charset of the fields from latin to UTF-8. So I exported the old data, changed the charset to UTF-8 on all required fields and re-imported data into the new table.

Problem is, I still need to iterate the array and utf8_encode() each text field before passing array to json_encode() - so I am wondering whether I need a script (SQL or PHP) that will convert all characters in theB to UTF-8 as well? Is it not enough to convert just the field? I assumed that would convert the existing characters at import.

ANy ideas? Input? Insight?

Cheers,
Alex

Re: UTF-8 MySQL tables, HTML output and more...

Posted: Tue Aug 17, 2010 12:39 pm
by Weirdan
What is your current schema (show create table), what you have in the file for those special characters (hex encoded) and how you did reimport the file (exact command line you used)?

Re: UTF-8 MySQL tables, HTML output and more...

Posted: Wed Aug 18, 2010 8:04 am
by alex.barylski
Thanks for the quick reply:


Create Table

Code: Select all

CREATE TABLE `rpi_sequences` (
  `id_primary` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `id_previous` int(11) unsigned NOT NULL,
  `id_task` int(11) unsigned NOT NULL,
  `id_station` int(11) unsigned NOT NULL,
  `date_created` int(11) unsigned NOT NULL,
  `time_bench` float unsigned NOT NULL,
  `time_machine` float unsigned NOT NULL,
  `description` text NOT NULL,
  `description_cache` text NOT NULL,
  PRIMARY KEY (`id_primary`),
  KEY `id_station` (`id_station`),
  FULLTEXT KEY `description_cache` (`description_cache`)
) ENGINE=MyISAM AUTO_INCREMENT=12375 DEFAULT CHARSET=utf8
How I imported I couldn't tell you as I dumped the contents using SQLYog saved as a SQL dump and re-imported.

When I view the data in SqlYog or dump the data to the browser the symbol is clearly coming across as a degree symbol (ALT+0176) -- I have tried replacing the character with the UTF equivelant (\u00B0) but this did nothing but actually come back as those characters, not the degree character that is needed.

Am I missing something in the conversion process?

Cheers,
Alex

Re: UTF-8 MySQL tables, HTML output and more...

Posted: Wed Aug 18, 2010 8:33 am
by alex.barylski
I found an article on entering UTF-8 characters, so I added a registry key, restarted and now at least when I enter Unicode ALT +B0 I can enter what I thought was a UTF degree symbol but it still causes errors with json_encode() which expects UTF8 characters only. Ugh!!! :banghead:

EDIT | I'm about to give up on this issue and just convert everything to htmlentities. I return the results using this in my SELECT:

Code: Select all

CONVERT(description_cache USING utf8)
And json_encode() still chokes on the bloody degree symbol I'm at a loss for ideas now.

Cheers,
Alex

Re: UTF-8 MySQL tables, HTML output and more...

Posted: Wed Aug 18, 2010 8:42 am
by Eran
I'm not sure about the export process you described, but previously when faced with a similar problem I connected to the database using a latin1 connection (forcing the connection using SET NAMES and SET CHARACTER SET), read the data into an array, changed the connection charset to UTF8 and reinserted. That seems to have solved the issue