Curly quotes garbage to UTF-8

Questions about the MySQL, PostgreSQL, and most other databases, as well as using it with PHP can be asked here.

Moderator: General Moderators

Post Reply
icesolid
Forum Regular
Posts: 502
Joined: Mon May 06, 2002 9:36 pm
Location: Buffalo, NY

Curly quotes garbage to UTF-8

Post by icesolid »

My scripts have been submitting garbage iso-8859-1 MS Word curly quotes and such into a MySQL database. I need to convert all of the tables garbage iso-8859-1 curly quotes and other MS Word input into clean utf-8 compliant encoding.

Any ideas?
User avatar
batfastad
Forum Contributor
Posts: 433
Joined: Tue Mar 30, 2004 4:24 am
Location: London, UK

Re: Curly quotes garbage to UTF-8

Post by batfastad »

I had exactly the same problem recently. I wanted to prevent users from entering curly quotes, emdashes, elipses anywhere in my solution. There's more on this problem here http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

Code: Select all

// REPLACE ANY STUPID UTF8 CHARS
function utf8_char_replace(&$var) {
 
    $trans_table = array(
        chr(0xe2).chr(0x80).chr(0x9a) => '\'', //SINGLE LOW-9 QUOTATION MARK
        chr(0xe2).chr(0x80).chr(0x9e) => '"', //DOUBLE LOW-9 QUOTATION MARK
        chr(0xe2).chr(0x80).chr(0xa6) => '...', //HORIZONTAL ELLIPSIS
        chr(0xe2).chr(0x80).chr(0x98) => '\'', //LEFT SINGLE QUOTATION MARK
        chr(0xe2).chr(0x80).chr(0x99) => '\'', //RIGHT SINGLE QUOTATION MARK
        chr(0xe2).chr(0x80).chr(0x9c) => '"', //LEFT DOUBLE QUOTATION MARK
        chr(0xe2).chr(0x80).chr(0x9d) => '"', //RIGHT DOUBLE QUOTATION MARK
        chr(0xe2).chr(0x80).chr(0x93) => '-', //EN DASH
        chr(0xe2).chr(0x80).chr(0x94) => '-' //EM DASH
    );
 
    foreach ($trans_table as $utf8_code => $replace) {
        $var = str_replace($utf8_code, $replace, $var);
    }
 
    return $var;
}
array_walk_recursive($_GET, 'utf8_char_replace');
array_walk_recursive($_POST, 'utf8_char_replace');
array_walk_recursive($_COOKIE, 'utf8_char_replace');
Add this in the top of your database "save" script, before you get the POST/GET data into variables and save into MySQL. It basically goes through $_GET, $_POST, $_COOKIE and replaces any annoying characters with their regular equivalents. So an en/em dash gets replaced with a regular dash. And curly quotes get replaced with their regular equivalents.

It took me a while to figure this out, but when the user pastes from word into the textarea in the browser, it's in the stupid windows iso-8859-1 charset.
But when the user presses submit, it was being converted (by the browser I guess) into UTF-8. So this function actually replaces the UTF-8 codes for curly quotes, with their simpler equivalents.
For a couple of days I was trying to replace the iso-8859-1 chars, and couldn't figure out why they weren't being replaced.
I used these tables to get the UTF8 values that I needed to replace: http://www.manderby.com/mandalex/a/ascii.php so you could add further values to the $trans_table array.

This function sits in my site-wide config file and gets run on every page, just after my fixing magic quotes global function. I use UTF-8 throughout my scripts and database.

Hope this helps you out

Cheers, B
icesolid
Forum Regular
Posts: 502
Joined: Mon May 06, 2002 9:36 pm
Location: Buffalo, NY

Re: Curly quotes garbage to UTF-8

Post by icesolid »

Thanks for the tip. Not much information out there about this whole Word / Textarea issue. :banghead:
User avatar
batfastad
Forum Contributor
Posts: 433
Joined: Tue Mar 30, 2004 4:24 am
Location: London, UK

Re: Curly quotes garbage to UTF-8

Post by batfastad »

Ah I just re-read your post and you wanted to know how to replace it once it's already in the database :oops:

The only thing I would suggest is ripping through the records and replacing the offending characters in each field:

Code: Select all

UPDATE `table` SET `field1` = REPLACE(`field1`,'“','"'), `field2`=REPLACE... etc WHERE 1;
Although someone better at MySQL might have a quicker/more efficient way, but that's what I would try first.

HTH, B
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Curly quotes garbage to UTF-8

Post by VladSun »

It's already done:
viewtopic.php?f=2&t=91003
There are 10 types of people in this world, those who understand binary and those who don't
Post Reply