Posted: Thu Aug 24, 2006 8:38 pm
No it isn't. The two have different functions. htmlentities is all about converting everything possible into entities. htmlspecialchars is all about converting just what's needed into HTML. Given the proper output and input character encodings, htmlspecialchars is highly effective, while htmlentities actually becomes redundant and useless. This is because many of those numeric entities are meant to represent characters not in the character set you were currently using. UTF-8 supports all those characters, so they can be output directly into the HTML.OK, but htmlentities is moreexhaustive, so I have been using it in that premise.
Sorry to be blunt, but that's quite a naive approach. Mbstring does a certain job and does it well, that is, multibyte sensitive string functions, but it's not necessary for UTF-8 support. HTMLPurifier, for example, supports only UTF-8, and doesn't require mbstring at all. Handling UTF-8 strings is quite simple because UTF-8 is built in a way that you'll never confuse a character with the internals of a multibyte character. If you need complex text manipulation, there's loads of stable pure PHP libraries to do things like case-conversion for you.In fact at the moment OsisForms does not support Unicode, I have made no attempt for it to do so considering that it would make the mb_string extension a requirement and PHP 6 will solve this anyway.
Furthermore, you should be careful not to confuse Unicode with UTF-8. Unicode is a standard, UTF-8 is an encoding/character set. Unicode can actually be encoded in different ways: UTF-16, punycode, etc.
In short, supporting UTF-8 are these steps:
1. Make sure HTML sends out header('Content-type:text/html;charset=utf-8'); and the corresponding meta tag
2. Passing all input strings through a UTF-8 parser (iconv, mbstring, or pure PHP) to ensure that it's well-formed and that there are no non-SGML codepoints in them
3. Escaping all data with htmlspecialchars() set to UTF-8 encoding
Probably not. However, it would be trivially easy to cause the page to stop validating. You must ensure that non-SGML code-points are removed from the string.Are you saying that even if I specify the character encoding in htmlentities() I am still vunerible to XSS?