My scripts have been submitting garbage iso-8859-1 MS Word curly quotes and such into a MySQL database. I need to convert all of the tables garbage iso-8859-1 curly quotes and other MS Word input into clean utf-8 compliant encoding.
Any ideas?
Curly quotes garbage to UTF-8
Moderator: General Moderators
Re: Curly quotes garbage to UTF-8
I had exactly the same problem recently. I wanted to prevent users from entering curly quotes, emdashes, elipses anywhere in my solution. There's more on this problem here http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
Add this in the top of your database "save" script, before you get the POST/GET data into variables and save into MySQL. It basically goes through $_GET, $_POST, $_COOKIE and replaces any annoying characters with their regular equivalents. So an en/em dash gets replaced with a regular dash. And curly quotes get replaced with their regular equivalents.
It took me a while to figure this out, but when the user pastes from word into the textarea in the browser, it's in the stupid windows iso-8859-1 charset.
But when the user presses submit, it was being converted (by the browser I guess) into UTF-8. So this function actually replaces the UTF-8 codes for curly quotes, with their simpler equivalents.
For a couple of days I was trying to replace the iso-8859-1 chars, and couldn't figure out why they weren't being replaced.
I used these tables to get the UTF8 values that I needed to replace: http://www.manderby.com/mandalex/a/ascii.php so you could add further values to the $trans_table array.
This function sits in my site-wide config file and gets run on every page, just after my fixing magic quotes global function. I use UTF-8 throughout my scripts and database.
Hope this helps you out
Cheers, B
Code: Select all
// REPLACE ANY STUPID UTF8 CHARS
function utf8_char_replace(&$var) {
$trans_table = array(
chr(0xe2).chr(0x80).chr(0x9a) => '\'', //SINGLE LOW-9 QUOTATION MARK
chr(0xe2).chr(0x80).chr(0x9e) => '"', //DOUBLE LOW-9 QUOTATION MARK
chr(0xe2).chr(0x80).chr(0xa6) => '...', //HORIZONTAL ELLIPSIS
chr(0xe2).chr(0x80).chr(0x98) => '\'', //LEFT SINGLE QUOTATION MARK
chr(0xe2).chr(0x80).chr(0x99) => '\'', //RIGHT SINGLE QUOTATION MARK
chr(0xe2).chr(0x80).chr(0x9c) => '"', //LEFT DOUBLE QUOTATION MARK
chr(0xe2).chr(0x80).chr(0x9d) => '"', //RIGHT DOUBLE QUOTATION MARK
chr(0xe2).chr(0x80).chr(0x93) => '-', //EN DASH
chr(0xe2).chr(0x80).chr(0x94) => '-' //EM DASH
);
foreach ($trans_table as $utf8_code => $replace) {
$var = str_replace($utf8_code, $replace, $var);
}
return $var;
}
array_walk_recursive($_GET, 'utf8_char_replace');
array_walk_recursive($_POST, 'utf8_char_replace');
array_walk_recursive($_COOKIE, 'utf8_char_replace');It took me a while to figure this out, but when the user pastes from word into the textarea in the browser, it's in the stupid windows iso-8859-1 charset.
But when the user presses submit, it was being converted (by the browser I guess) into UTF-8. So this function actually replaces the UTF-8 codes for curly quotes, with their simpler equivalents.
For a couple of days I was trying to replace the iso-8859-1 chars, and couldn't figure out why they weren't being replaced.
I used these tables to get the UTF8 values that I needed to replace: http://www.manderby.com/mandalex/a/ascii.php so you could add further values to the $trans_table array.
This function sits in my site-wide config file and gets run on every page, just after my fixing magic quotes global function. I use UTF-8 throughout my scripts and database.
Hope this helps you out
Cheers, B
Re: Curly quotes garbage to UTF-8
Thanks for the tip. Not much information out there about this whole Word / Textarea issue. 
Re: Curly quotes garbage to UTF-8
Ah I just re-read your post and you wanted to know how to replace it once it's already in the database 
The only thing I would suggest is ripping through the records and replacing the offending characters in each field:
Although someone better at MySQL might have a quicker/more efficient way, but that's what I would try first.
HTH, B
The only thing I would suggest is ripping through the records and replacing the offending characters in each field:
Code: Select all
UPDATE `table` SET `field1` = REPLACE(`field1`,'“','"'), `field2`=REPLACE... etc WHERE 1;HTH, B
Re: Curly quotes garbage to UTF-8
It's already done:
viewtopic.php?f=2&t=91003
viewtopic.php?f=2&t=91003
There are 10 types of people in this world, those who understand binary and those who don't