ISO-8859-1 to UTF-8 Headache!

XML, Perl, Python, and other languages can be discussed here, even if it isn't PHP (We might forgive you).

Moderator: General Moderators

Post Reply
icesolid
Forum Regular
Posts: 502
Joined: Mon May 06, 2002 9:36 pm
Location: Buffalo, NY

ISO-8859-1 to UTF-8 Headache!

Post by icesolid »

The company I work for wanted a re-design of the company website. So using this opportunity to make updates to standards on the website, I choose to update all of the design elements to XHTML Transitional and character encoding to UTF-8, I figured why not everyone else is doing it?

Wrong! The XHTML Transitional part of the updates was nice, a lot more cross browser support, I like it!, but the character encoding was a problem.

Everything works fine however, special characters, Word characters into a textarea, then submitted to the database when I was using ISO-8859-1, now outputted back to that textarea using UTF-8, are all jumbled up.

Those squiqly double and single quotes and commas from Word that were submitted in the ISO-8859-1 encoding are all messed up.

So basically my question is, how can I convert all of that data in my database to the new UTF-8 encoding I am using?

Also some other scripts like fpdf render my new UTF-8 database output incorrectly, I think fpdf is trying to use ISO-8859-1 or PHP is trying to tell it to use ISO-8859-1?
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: ISO-8859-1 to UTF-8 Headache!

Post by Eran »

If you don't have unicode characters in your database, then converting could be as simple as changing every column and table collation in your database to utf8 (utf8_unicode_ci / utf8_general_ci - read more here).
If that gives you problems, you could try to write a script that converts the data between two tables at a time (an old ISO-8859-1 table and a new UTF8 table with the same structure), using PHP built in functions for character conversion (iconv and mb_convert_encoding).

Be sure to backup your database before you try anything.
icesolid
Forum Regular
Posts: 502
Joined: Mon May 06, 2002 9:36 pm
Location: Buffalo, NY

Re: ISO-8859-1 to UTF-8 Headache!

Post by icesolid »

How come all of my MySQL default collation is latin1_sweidish_ci but my MySQL connection collation is utf8?

Does it matter if each of my fields are latin1_swedish?
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: ISO-8859-1 to UTF-8 Headache!

Post by Eran »

Yes, they should all be the same utf8 encoding you choose for your table. The column collation is the on that matters - the table collation is merely the default collation for determining collation for new columns.
icesolid
Forum Regular
Posts: 502
Joined: Mon May 06, 2002 9:36 pm
Location: Buffalo, NY

Re: ISO-8859-1 to UTF-8 Headache!

Post by icesolid »

I am just wondering why MySQL is choosing latin1 as my default everytime I create a new table, when my default is set to utf8?

If I change the columns to utf8 from latin1, will the already existing data convert to utf8? Or will I still have to convert the existing data?
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: ISO-8859-1 to UTF-8 Headache!

Post by Eran »

I am just wondering why MySQL is choosing latin1 as my default everytime I create a new table, when my default is set to utf8?
Each database has its own collation as well, which is the default for new tables in that database. It might be different than the default for the database server.
If I change the columns to utf8 from latin1, will the already existing data convert to utf8? Or will I still have to convert the existing data?
It will be converted successfully if no unique characters are present. Your initial post suggests that this isn't the case. You probably have HTML entities encoded in ISO-8859-1, which you will have to convert to UTF8. Something like:

Code: Select all

 
//$field is a value stored in a field with a ISO-8859-1 encoding
$field = html_entity_decode($field,ENT_COMPAT,'ISO-8859-1');
$field = htmlentities($field, ENT_COMPAT,'UTF-8');
 
Remember you have to switch the encoding of the connection when working with databases of different collations. Use 'SET CHARACTER SET ...' and 'SET NAMES ...' to control it.
icesolid
Forum Regular
Posts: 502
Joined: Mon May 06, 2002 9:36 pm
Location: Buffalo, NY

Re: ISO-8859-1 to UTF-8 Headache!

Post by icesolid »

Thank you very much for your help.

My last and final question. Is it really important to switch to UTF-8? The only people who will ever access this website are english language people from America, so ISO-8859-1 should be fine right?

My biggest problem through all of this was workers copying text from their Word document and pasting it into a textarea and submitting it to the database. In ISO-8859-1 there are no problems with words double quotes and such, but using UTF-8 the quotes come out in mumbled characters.

In the future will everything be in UTF-8 encoding?
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: ISO-8859-1 to UTF-8 Headache!

Post by Eran »

Generally speaking, if you don't need internationalization then you don't need UTF8 for storing text. Also, the collation shouldn't matter for characters like double-quotes - if they are stored as such. Since you say that people are copy pasting from word, it's possible those are not the normal double-quotes, but the reverse double-quotes (“ as opposed to "). If this is your only problem, simply replace those in any incoming input before it goes in the database.
icesolid
Forum Regular
Posts: 502
Joined: Mon May 06, 2002 9:36 pm
Location: Buffalo, NY

Re: ISO-8859-1 to UTF-8 Headache!

Post by icesolid »

UTF-8 should be able to handle those MS Word curly quotes though?

My problem is getting my whole site situated using UTF-8. I had to make php.ini changes and my.ini changes. I think possibly Apache setting changes to. They were all set to use iso-8859-1 and were overriding my HTML doc charset declaration.
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: ISO-8859-1 to UTF-8 Headache!

Post by Eran »

UTF-8 should be able to handle those characters. You should try and see if it works (don't forget to backup your database first).
Post Reply