Page 1 of 1

Wrong Character Input

Posted: Tue Feb 05, 2008 12:48 pm
by hawleyjr
When text that copied from Word and is being inserted into a database I'm having the wrong characters inserted into the database.

Code: Select all

nce’ to “your” com
I'm assuming this is because of the wrong character type being copied from Word. What is the best way to convert the characters to make sure they are correct?

I know I can use str_replace() but I really don't want to test every possible invalid character inputed through word.

Thanks guys.

Re: Wrong Character Input

Posted: Tue Feb 05, 2008 12:56 pm
by JAM
My thought is that MS likely have some sort of internal encoding of the documents (unlike OpenDoc) so I'm not sure there is a single function to use at all.

Re: Wrong Character Input

Posted: Tue Feb 05, 2008 12:59 pm
by hawleyjr
I was messing with the following, but I really don't want to play with Word anymore

Code: Select all

 
$value = str_replace ( array( '’','‘','“','”','…','©','®','™','—','–','%u2013',"\t",'%u2019' ), array( "'","'",'"','"','...','&copy','®','™','-','-','-','   ',"'" ), $value );
 

Re: Wrong Character Input

Posted: Tue Feb 05, 2008 1:06 pm
by JAM
I was messing with the following, but I really don't want to play with Word anymore
:lol:

You could, if it's possible, to try saving the file(s) in rtf and then try reading them again. Might get you a different result. Not usable in the long way if you want to use other's sources tho...

Re: Wrong Character Input

Posted: Tue Feb 05, 2008 1:08 pm
by Kieran Huggins
Word is the devil - avoid it at all costs! The characters it uses have always been a major headache.

Re: Wrong Character Input

Posted: Tue Feb 05, 2008 1:17 pm
by hawleyjr
Kieran Huggins wrote:Word is the devil - avoid it at all costs! The characters it uses have always been a major headache.
Yeah Word does suck but when you are dealing with customers and 99.9% of them use Word you have no choice but to play nice.

Anyway, back on topic. any ideas?

Re: Wrong Character Input

Posted: Tue Feb 05, 2008 1:28 pm
by Kieran Huggins
http://www.fckeditor.net supports "paste from Word with cleanup" - maybe it's worth looking at?

Re: Wrong Character Input

Posted: Tue Feb 05, 2008 1:54 pm
by Christopher
What character set are you specifying for your HTML doc? If it is UTF-8 you will be in for a lot of pain if you use MS docs as your content source. Switch to something like ISO 8859-1 or if all else fails Windows-1252.

Re: Wrong Character Input

Posted: Tue Feb 05, 2008 2:09 pm
by dml
Just taking this character sequence - “ ... That was at one time a correctly utf8-encoded double left quotation mark (u201C), that was processed somewhere using a single-byte encoding. It might be that it was misread on the way into the database - verify this by checking the exact bytes that have gone in - in which case you're going to have to fix the data in the tables, or it might be that it's being misread on the way out of the database, in which case it might suffice to set the content-type header.

Code: Select all

 
// will display “
header("Content-Type: text/plain; charset=cp1252");
echo "\xe2\x80\x9c";
 

Code: Select all

 
// will display “
header("Content-Type: text/plain; charset=utf8");
echo "\xe2\x80\x9c";
 

Code: Select all

 
// will also display “
header("Content-Type: text/plain; charset=cp1252");
echo "\x93";
 
” - this is a double right quote mark that's been mangled in the same way.
nce’ - don't know what this is

It's likely that this can be fixed with some sequence of encoding conversions. Trying to str_replace could make things worse by garbling half a character in a multi-byte encoding.