Page 1 of 1

ISO-8859-1 to UTF-8 Headache!

Posted: Fri Nov 14, 2008 8:53 am
by icesolid
The company I work for wanted a re-design of the company website. So using this opportunity to make updates to standards on the website, I choose to update all of the design elements to XHTML Transitional and character encoding to UTF-8, I figured why not everyone else is doing it?

Wrong! The XHTML Transitional part of the updates was nice, a lot more cross browser support, I like it!, but the character encoding was a problem.

Everything works fine however, special characters, Word characters into a textarea, then submitted to the database when I was using ISO-8859-1, now outputted back to that textarea using UTF-8, are all jumbled up.

Those squiqly double and single quotes and commas from Word that were submitted in the ISO-8859-1 encoding are all messed up.

So basically my question is, how can I convert all of that data in my database to the new UTF-8 encoding I am using?

Also some other scripts like fpdf render my new UTF-8 database output incorrectly, I think fpdf is trying to use ISO-8859-1 or PHP is trying to tell it to use ISO-8859-1?

Re: ISO-8859-1 to UTF-8 Headache!

Posted: Fri Nov 14, 2008 9:01 am
by Eran
If you don't have unicode characters in your database, then converting could be as simple as changing every column and table collation in your database to utf8 (utf8_unicode_ci / utf8_general_ci - read more here).
If that gives you problems, you could try to write a script that converts the data between two tables at a time (an old ISO-8859-1 table and a new UTF8 table with the same structure), using PHP built in functions for character conversion (iconv and mb_convert_encoding).

Be sure to backup your database before you try anything.

Re: ISO-8859-1 to UTF-8 Headache!

Posted: Fri Nov 14, 2008 9:15 am
by icesolid
How come all of my MySQL default collation is latin1_sweidish_ci but my MySQL connection collation is utf8?

Does it matter if each of my fields are latin1_swedish?

Re: ISO-8859-1 to UTF-8 Headache!

Posted: Fri Nov 14, 2008 9:17 am
by Eran
Yes, they should all be the same utf8 encoding you choose for your table. The column collation is the on that matters - the table collation is merely the default collation for determining collation for new columns.

Re: ISO-8859-1 to UTF-8 Headache!

Posted: Fri Nov 14, 2008 9:22 am
by icesolid
I am just wondering why MySQL is choosing latin1 as my default everytime I create a new table, when my default is set to utf8?

If I change the columns to utf8 from latin1, will the already existing data convert to utf8? Or will I still have to convert the existing data?

Re: ISO-8859-1 to UTF-8 Headache!

Posted: Fri Nov 14, 2008 10:09 am
by Eran
I am just wondering why MySQL is choosing latin1 as my default everytime I create a new table, when my default is set to utf8?
Each database has its own collation as well, which is the default for new tables in that database. It might be different than the default for the database server.
If I change the columns to utf8 from latin1, will the already existing data convert to utf8? Or will I still have to convert the existing data?
It will be converted successfully if no unique characters are present. Your initial post suggests that this isn't the case. You probably have HTML entities encoded in ISO-8859-1, which you will have to convert to UTF8. Something like:

Code: Select all

 
//$field is a value stored in a field with a ISO-8859-1 encoding
$field = html_entity_decode($field,ENT_COMPAT,'ISO-8859-1');
$field = htmlentities($field, ENT_COMPAT,'UTF-8');
 
Remember you have to switch the encoding of the connection when working with databases of different collations. Use 'SET CHARACTER SET ...' and 'SET NAMES ...' to control it.

Re: ISO-8859-1 to UTF-8 Headache!

Posted: Fri Nov 14, 2008 11:56 am
by icesolid
Thank you very much for your help.

My last and final question. Is it really important to switch to UTF-8? The only people who will ever access this website are english language people from America, so ISO-8859-1 should be fine right?

My biggest problem through all of this was workers copying text from their Word document and pasting it into a textarea and submitting it to the database. In ISO-8859-1 there are no problems with words double quotes and such, but using UTF-8 the quotes come out in mumbled characters.

In the future will everything be in UTF-8 encoding?

Re: ISO-8859-1 to UTF-8 Headache!

Posted: Fri Nov 14, 2008 2:56 pm
by Eran
Generally speaking, if you don't need internationalization then you don't need UTF8 for storing text. Also, the collation shouldn't matter for characters like double-quotes - if they are stored as such. Since you say that people are copy pasting from word, it's possible those are not the normal double-quotes, but the reverse double-quotes (“ as opposed to "). If this is your only problem, simply replace those in any incoming input before it goes in the database.

Re: ISO-8859-1 to UTF-8 Headache!

Posted: Fri Nov 14, 2008 3:00 pm
by icesolid
UTF-8 should be able to handle those MS Word curly quotes though?

My problem is getting my whole site situated using UTF-8. I had to make php.ini changes and my.ini changes. I think possibly Apache setting changes to. They were all set to use iso-8859-1 and were overriding my HTML doc charset declaration.

Re: ISO-8859-1 to UTF-8 Headache!

Posted: Fri Nov 14, 2008 3:05 pm
by Eran
UTF-8 should be able to handle those characters. You should try and see if it works (don't forget to backup your database first).