Page 1 of 1

Curly quotes garbage to UTF-8

Posted: Fri Nov 14, 2008 2:51 pm
by icesolid
My scripts have been submitting garbage iso-8859-1 MS Word curly quotes and such into a MySQL database. I need to convert all of the tables garbage iso-8859-1 curly quotes and other MS Word input into clean utf-8 compliant encoding.

Any ideas?

Re: Curly quotes garbage to UTF-8

Posted: Thu Nov 20, 2008 3:17 am
by batfastad
I had exactly the same problem recently. I wanted to prevent users from entering curly quotes, emdashes, elipses anywhere in my solution. There's more on this problem here http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

Code: Select all

// REPLACE ANY STUPID UTF8 CHARS
function utf8_char_replace(&$var) {
 
    $trans_table = array(
        chr(0xe2).chr(0x80).chr(0x9a) => '\'', //SINGLE LOW-9 QUOTATION MARK
        chr(0xe2).chr(0x80).chr(0x9e) => '"', //DOUBLE LOW-9 QUOTATION MARK
        chr(0xe2).chr(0x80).chr(0xa6) => '...', //HORIZONTAL ELLIPSIS
        chr(0xe2).chr(0x80).chr(0x98) => '\'', //LEFT SINGLE QUOTATION MARK
        chr(0xe2).chr(0x80).chr(0x99) => '\'', //RIGHT SINGLE QUOTATION MARK
        chr(0xe2).chr(0x80).chr(0x9c) => '"', //LEFT DOUBLE QUOTATION MARK
        chr(0xe2).chr(0x80).chr(0x9d) => '"', //RIGHT DOUBLE QUOTATION MARK
        chr(0xe2).chr(0x80).chr(0x93) => '-', //EN DASH
        chr(0xe2).chr(0x80).chr(0x94) => '-' //EM DASH
    );
 
    foreach ($trans_table as $utf8_code => $replace) {
        $var = str_replace($utf8_code, $replace, $var);
    }
 
    return $var;
}
array_walk_recursive($_GET, 'utf8_char_replace');
array_walk_recursive($_POST, 'utf8_char_replace');
array_walk_recursive($_COOKIE, 'utf8_char_replace');
Add this in the top of your database "save" script, before you get the POST/GET data into variables and save into MySQL. It basically goes through $_GET, $_POST, $_COOKIE and replaces any annoying characters with their regular equivalents. So an en/em dash gets replaced with a regular dash. And curly quotes get replaced with their regular equivalents.

It took me a while to figure this out, but when the user pastes from word into the textarea in the browser, it's in the stupid windows iso-8859-1 charset.
But when the user presses submit, it was being converted (by the browser I guess) into UTF-8. So this function actually replaces the UTF-8 codes for curly quotes, with their simpler equivalents.
For a couple of days I was trying to replace the iso-8859-1 chars, and couldn't figure out why they weren't being replaced.
I used these tables to get the UTF8 values that I needed to replace: http://www.manderby.com/mandalex/a/ascii.php so you could add further values to the $trans_table array.

This function sits in my site-wide config file and gets run on every page, just after my fixing magic quotes global function. I use UTF-8 throughout my scripts and database.

Hope this helps you out

Cheers, B

Re: Curly quotes garbage to UTF-8

Posted: Thu Nov 20, 2008 9:18 am
by icesolid
Thanks for the tip. Not much information out there about this whole Word / Textarea issue. :banghead:

Re: Curly quotes garbage to UTF-8

Posted: Thu Nov 20, 2008 9:30 am
by batfastad
Ah I just re-read your post and you wanted to know how to replace it once it's already in the database :oops:

The only thing I would suggest is ripping through the records and replacing the offending characters in each field:

Code: Select all

UPDATE `table` SET `field1` = REPLACE(`field1`,'“','"'), `field2`=REPLACE... etc WHERE 1;
Although someone better at MySQL might have a quicker/more efficient way, but that's what I would try first.

HTH, B

Re: Curly quotes garbage to UTF-8

Posted: Thu Nov 20, 2008 10:06 am
by VladSun
It's already done:
viewtopic.php?f=2&t=91003