Page 1 of 1

form data encoding problem...

Posted: Mon Sep 20, 2004 5:34 pm
by newmember
my system:winxp+apache1.3+php4.3.5+IE+firefox

i'm recieving form data which might contain characters from different languages and save it to file...
recently i ran into problem...

my form has: accept-charset="utf-8"

i'll take for example two completely different character sets:
hebrew and russian...

first case: i write in form ONLY hebrew
-when i check the file, i see that these characters became russian characters...(not good :? )

second case: i write in form hebrew and russian
-when i check the file i see both russian and hebrew characters(that is everything as it should be)

i did the same test with firefox and the file in both cases looks like it should be... there is hebrew along russian

(it is probably related to how IE encodes form data before it sends it to server...
if that's the cause.. i don't know how to solve this :? )

can anyone please help me on this?
thanks

Posted: Tue Sep 21, 2004 11:13 am
by newmember
i still don't know how to solve this... :?
here is what i know for now:

i did additional checks under same conditions:

i enter exactly the same text(in hebrew)...
* if use ie to submit form then file size is 25 bytes which is exactly one byte for character.. so it is not saved as utf-8.
* if use firefox to submit form then file size is 36 bytes and when i open it i see hebrew text...

so i think maybe when ie sees only hebrew and english text in form, it encodes it as ISO-8859-hebrew_charset...but firefox encodes the data always as utf-8...

i tried to run utf8_encode() on the input,thinking maybe php holds string in ISO-8859-hebrew_charset encoding, but then thing go wrong completely...
actualy utf8_encode() can handle only ISO-8859-1

and another test...
i went to php.net manual and looked in comments that people write.i found there a function seems_utf8() which checks if a string in UTF_8 or not...
so i ran this function on input from form that ie sends...the results were:
* if i write hebrew and russian then seems_utf8() returns true...so that means ie sent utf-8 encoded data.
* if i write ONLY hebrew then seems_utf8() return false, meaning that the data arrived as not UTF_8 but in some other encoding.

while firefox ALWAYS sends utf-8 encoded data...

so i really lost here... :?

it looks like php script depend on browser's mercy...!!!

also, i talked on php channel in mirc and someone there said that it's practicly impossible to make multilingual pages with php...but i'm not asking much...i need only utf-8 support

meanwhile i thougth about two solutions:
* first is to force browser to return data as utf-8...but i'm not sure it is really possible...

* and second, is to put hidden input element with hebrew and russian characters inside form(but with this approach i will have to enter character for each language).
i didn't tested the second solution but i think it will work almost for sure...

but this is all workarounds...

so maybe someone encoutered similar difficalties and knows how to overcome this problem?

Posted: Tue Sep 21, 2004 11:39 am
by feyd
maybe if you were in a different character set, it'd use unicode entities... I know whenever I paste characters outside my character set, I get unicode entities..

Posted: Tue Sep 21, 2004 11:49 am
by newmember
php htmlentities() doesn't help either, (it was the first function i tried)
htmlentities() doesn't support hebrew codepage as you can check in manual.
and as i described in earlier posts ie sends data encoded in hebrew codepage from the start.
(i even printed the translation table with get_html_translation_table() just to see what is in there...and no traceof hebrew ofcource)

Posted: Tue Sep 21, 2004 11:57 am
by Weirdan
MSDN wrote: Syntax

HTML <FORM ACCEPTCHARSET = sChar... >
Scripting FORM.acceptCharset(v) [ = sChar ]

Possible Values

sChar
String that specifies or receives a space- and/or comma-delimited list of charset values.

UTF-8
If the user enters characters that are not in the character set of the document containing the form, the UTF-8 character set will be used. UTF-8 is the preferred format for multilingual text.

Remarks

If the this attribute is not specified, the form will be submitted in the character encoding specified for the document. If the form includes characters outside the character set specified for the document, Microsoft Internet Explorer will attempt to determine an appropriate character set. If an appropriate character set cannot be determined, then the characters outside of the character set will be encoded as an HTML numeric character reference. For more information on character sets and numerical character references, see HTML Character Sets.
try setting encoding of the document to UTF-8 (eg via <meta http-equiv="Content-type" content="text/html; charset=UTF-8" />)

Posted: Tue Sep 21, 2004 12:03 pm
by feyd
heh, I guessed right :P

Posted: Tue Sep 21, 2004 12:18 pm
by newmember
i will check right now
the fact is that i don't specify character encoding for document, i thought to leave these details to the end
i really hope it is a real solution:)

Posted: Tue Sep 21, 2004 12:30 pm
by newmember
:D
this is works n :D w
so simple and basicly my fault...at least now i know why setting language for document is important...

thank you very m :D ch

Posted: Tue Sep 21, 2004 12:33 pm
by Weirdan
you're welcome.

btw, msdn.microsoft.com is great site where many IE quirks are documented. You should consider bookmarking it if you seriously in developing for IE.

Posted: Tue Sep 21, 2004 12:57 pm
by newmember
i have this for quite some time... :D

Image