Wrong Character Input

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
hawleyjr
BeerMod
Posts: 2170
Joined: Tue Jan 13, 2004 4:58 pm
Location: Jax FL & Spokane WA USA

Wrong Character Input

Post by hawleyjr »

When text that copied from Word and is being inserted into a database I'm having the wrong characters inserted into the database.

Code: Select all

nce’ to “your” com
I'm assuming this is because of the wrong character type being copied from Word. What is the best way to convert the characters to make sure they are correct?

I know I can use str_replace() but I really don't want to test every possible invalid character inputed through word.

Thanks guys.
User avatar
JAM
DevNet Resident
Posts: 2101
Joined: Fri Aug 08, 2003 6:53 pm
Location: Sweden
Contact:

Re: Wrong Character Input

Post by JAM »

My thought is that MS likely have some sort of internal encoding of the documents (unlike OpenDoc) so I'm not sure there is a single function to use at all.
User avatar
hawleyjr
BeerMod
Posts: 2170
Joined: Tue Jan 13, 2004 4:58 pm
Location: Jax FL & Spokane WA USA

Re: Wrong Character Input

Post by hawleyjr »

I was messing with the following, but I really don't want to play with Word anymore

Code: Select all

 
$value = str_replace ( array( '’','‘','“','”','…','©','®','™','—','–','%u2013',"\t",'%u2019' ), array( "'","'",'"','"','...','&copy','®','™','-','-','-','   ',"'" ), $value );
 
User avatar
JAM
DevNet Resident
Posts: 2101
Joined: Fri Aug 08, 2003 6:53 pm
Location: Sweden
Contact:

Re: Wrong Character Input

Post by JAM »

I was messing with the following, but I really don't want to play with Word anymore
:lol:

You could, if it's possible, to try saving the file(s) in rtf and then try reading them again. Might get you a different result. Not usable in the long way if you want to use other's sources tho...
User avatar
Kieran Huggins
DevNet Master
Posts: 3635
Joined: Wed Dec 06, 2006 4:14 pm
Location: Toronto, Canada
Contact:

Re: Wrong Character Input

Post by Kieran Huggins »

Word is the devil - avoid it at all costs! The characters it uses have always been a major headache.
User avatar
hawleyjr
BeerMod
Posts: 2170
Joined: Tue Jan 13, 2004 4:58 pm
Location: Jax FL & Spokane WA USA

Re: Wrong Character Input

Post by hawleyjr »

Kieran Huggins wrote:Word is the devil - avoid it at all costs! The characters it uses have always been a major headache.
Yeah Word does suck but when you are dealing with customers and 99.9% of them use Word you have no choice but to play nice.

Anyway, back on topic. any ideas?
User avatar
Kieran Huggins
DevNet Master
Posts: 3635
Joined: Wed Dec 06, 2006 4:14 pm
Location: Toronto, Canada
Contact:

Re: Wrong Character Input

Post by Kieran Huggins »

http://www.fckeditor.net supports "paste from Word with cleanup" - maybe it's worth looking at?
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Re: Wrong Character Input

Post by Christopher »

What character set are you specifying for your HTML doc? If it is UTF-8 you will be in for a lot of pain if you use MS docs as your content source. Switch to something like ISO 8859-1 or if all else fails Windows-1252.
(#10850)
dml
Forum Contributor
Posts: 133
Joined: Sat Jan 26, 2008 2:20 pm

Re: Wrong Character Input

Post by dml »

Just taking this character sequence - “ ... That was at one time a correctly utf8-encoded double left quotation mark (u201C), that was processed somewhere using a single-byte encoding. It might be that it was misread on the way into the database - verify this by checking the exact bytes that have gone in - in which case you're going to have to fix the data in the tables, or it might be that it's being misread on the way out of the database, in which case it might suffice to set the content-type header.

Code: Select all

 
// will display “
header("Content-Type: text/plain; charset=cp1252");
echo "\xe2\x80\x9c";
 

Code: Select all

 
// will display “
header("Content-Type: text/plain; charset=utf8");
echo "\xe2\x80\x9c";
 

Code: Select all

 
// will also display “
header("Content-Type: text/plain; charset=cp1252");
echo "\x93";
 
” - this is a double right quote mark that's been mangled in the same way.
nce’ - don't know what this is

It's likely that this can be fixed with some sequence of encoding conversions. Trying to str_replace could make things worse by garbling half a character in a multi-byte encoding.
Post Reply