Re: Find what charset forms a form has been sent under
Posted: Thu Feb 26, 2009 1:05 pm
Read this: http://www.phpwact.org/php/i18n/charsets
A community of PHP developers offering assistance, advice, discussion, and friendship.
http://forums.devnetwork.net/
This might be the most important thing you can do. Do everything you can to instruct the client to send data in UTF-8. The "Declaring UTF-8" section of the page I linked explains how to do this.lkjkorn19 wrote:Using the accept-charset attribute in the <form> tag of my test document forced the browser to send it to the value of that attribute, therefore my problem of being able to send data to my page under various encodings is solved. (Tested in FF3 & IE7 on WinXP, this presumably applies for all OS's though(?).)
Going back to your first item, if you handle that correctly this becomes less of a concern. The best you can do is tell the client's browser that you want UTF-8 and hope it complies. If you really don't trust the character encoding of your data, you can write code to evaluate the binary data stream to try to determine the encoding, but that's no small task, and most likely will rely on some kind of heuristics which will probably yield false positives. Having said that, it's still a good idea to check the input you receive to make sure it looks like well-formed UTF-8. For that, re-read the "Interfacing with systems using other Charsets" and "Checking UTF-8 for Well Formedness" sections of the page I linked to before.lkjkorn19 wrote:Technically, if data is being sent to the server under another encoding; is there any way to find out which encoding it is sent under? The document you have referred to, as well as a couple of documents to which are referred in said document, state that it is important to know which encoding data is in, before being able to start something useful with it. That's great and all, but it's hard to do so when PHP does not allow it.
I avoid iconv() as much as possible. I prefer to let the browser handle the character encoding based on what you do in your first item. There are times when you need iconv(), but only when you know without a doubt what the input character encoding is. For the problem you describe, you do not know the input character encoding, therefore iconv() is not a good solution. When you do use iconv(), be advised that it is locale-dependent. Also, if you are going between ISO-8859-1 and UTF-8, always use utf8_encode and utf8_decode. This is because iconv() might not do it correctly depending on your locale.lkjkorn19 wrote:Interesting paragraph(s) about the functions utf8_encode & utf8_decode. Although the name indeed is misleading, the description that PHP.net provides is fairly straightforward. (Much to my embarrassment,) I didn't know that they only worked for ISO-8859-1 - UTF-8 conversion, respectively vice versa.
Even more interesting, the function iconv(). Seems like this could solve my problem in general, if the question in the second item of this list can be answered. Any downfalls of that function which I should watch out for? (Apart from the ones listed in the comments of PHP.net's iconv() manual, of course.)