Page 2 of 2

Posted: Thu Aug 24, 2006 8:38 pm
by Ambush Commander
OK, but htmlentities is moreexhaustive, so I have been using it in that premise.
No it isn't. The two have different functions. htmlentities is all about converting everything possible into entities. htmlspecialchars is all about converting just what's needed into HTML. Given the proper output and input character encodings, htmlspecialchars is highly effective, while htmlentities actually becomes redundant and useless. This is because many of those numeric entities are meant to represent characters not in the character set you were currently using. UTF-8 supports all those characters, so they can be output directly into the HTML.
In fact at the moment OsisForms does not support Unicode, I have made no attempt for it to do so considering that it would make the mb_string extension a requirement and PHP 6 will solve this anyway.
Sorry to be blunt, but that's quite a naive approach. Mbstring does a certain job and does it well, that is, multibyte sensitive string functions, but it's not necessary for UTF-8 support. HTMLPurifier, for example, supports only UTF-8, and doesn't require mbstring at all. Handling UTF-8 strings is quite simple because UTF-8 is built in a way that you'll never confuse a character with the internals of a multibyte character. If you need complex text manipulation, there's loads of stable pure PHP libraries to do things like case-conversion for you.

Furthermore, you should be careful not to confuse Unicode with UTF-8. Unicode is a standard, UTF-8 is an encoding/character set. Unicode can actually be encoded in different ways: UTF-16, punycode, etc.

In short, supporting UTF-8 are these steps:

1. Make sure HTML sends out header('Content-type:text/html;charset=utf-8'); and the corresponding meta tag
2. Passing all input strings through a UTF-8 parser (iconv, mbstring, or pure PHP) to ensure that it's well-formed and that there are no non-SGML codepoints in them
3. Escaping all data with htmlspecialchars() set to UTF-8 encoding
Are you saying that even if I specify the character encoding in htmlentities() I am still vunerible to XSS?
Probably not. However, it would be trivially easy to cause the page to stop validating. You must ensure that non-SGML code-points are removed from the string.

Posted: Thu Aug 24, 2006 9:12 pm
by Ollie Saunders
htmlspecialchars is highly effective, while htmlentities actually becomes redundant and useless. This is because many of those numeric entities are meant to represent characters not in the character set you were currently using. UTF-8 supports all those characters, so they can be output directly into the HTML.
Congratulations, I am convinced :)
Sorry to be blunt
Be my guest.
Mbstring does a certain job and does it well, that is, multibyte sensitive string functions, but it's not necessary for UTF-8 support.
As I understand is UTF-8 is normal 8-bit ASCII characters with the added ability to insert any byte-length characters. So you can have a 5-byte string that only contains 4 characters. I thought seeing as all the PHP functions relyed on the number of bytes == number of characters that a lot of them would be UTF-8 incompatibile.
HTMLPurifier, for example, supports only UTF-8
I'm starting to think i should go that way.
Passing all input strings through a UTF-8 parser (iconv, mbstring, or pure PHP)
Could you provide examples of these
to ensure that it's well-formed and that there are no non-SGML codepoints in them
What are they?

Unicode is one of my major fuzzy areas. Seeing as you've come this far perhaps you would be patient enough to answer a whole load more questions.

Posted: Fri Aug 25, 2006 12:49 am
by matthijs
Unicode is one of my major fuzzy areas
Count me in as well :?

That was one thing that caught my attention after reading both Chris' book in which he uses/recommends htmlentities, while Ilia talks about htmlspecialchars, not even mentioning htmlenities. I have also read in several places how htmlentities would be "better, safer" then htmlspecialchars. But maybe we can start another thread about this (or kick myself under the but and start wading through those "light-reading" unicode articles..)

Posted: Fri Aug 25, 2006 3:54 am
by sike
Ambush Commander wrote:You must ensure that non-SGML code-points are removed from the string.
care to explain what a non-SGML code point is? never heard of that (:

Posted: Fri Aug 25, 2006 6:34 am
by Ambush Commander
As I understand is UTF-8 is normal 8-bit ASCII characters with the added ability to insert any byte-length characters.
While UTF-8 is backwards compatible with ASCII (valid ASCII text is valid UTF-8 text), UTF-8 is not just an "extension" of ASCII. It is a whole new character encoding.
So you can have a 5-byte string that only contains 4 characters. I thought seeing as all the PHP functions relyed on the number of bytes == number of characters that a lot of them would be UTF-8 incompatibile.
Correct. However, PHP has a nice feature where it treats all strings as binary. This means that as long as you are looking for specific characters, this will never be a problem: you will never confuse a byte a multibyte sequence with a full character.
Could you provide examples of these
My favored approach is a modified UTF8toUnicode function, you can see it here (scroll down to cleanUTF8). http://hp.jpsband.org/svnroot/htmlpurif ... /Lexer.php

With iconv, it would look like:

Code: Select all

$string = iconv('UTF-8//IGNORE', 'UTF-8', $string)
PCRE can check if a string is valid UTF-8 using the "u" modifier (although it can't be programmed to ignore invalid bytes, so we'll disregard it here.
What are they?
0 to 31 and 127 to 159. These are basically control characters. However, these ARE NOT bytes, so you can't naively replace all instances of "\0" in a document, 128 and greater are multibyte in UTF-8.
care to explain what a non-SGML code point is? never heard of that (:
See above. However, to be more general, a non-SGML code point is a character that SGML does not use, such as a null byte or a vertical tab.

Posted: Fri Aug 25, 2006 8:15 am
by Ollie Saunders
UTF-8 is built in a way that you'll never confuse a character with the internals of a multibyte character
How?
In short, supporting UTF-8 are these steps:

1. Make sure HTML sends out header('Content-type:text/html;charset=utf-8'); and the corresponding meta tag
2. Passing all input strings through a UTF-8 parser (iconv, mbstring, or pure PHP) to ensure that it's well-formed and that there are no non-SGML codepoints in them
3. Escaping all data with htmlspecialchars() set to UTF-8 encoding
I'm doing 1 and 3 already. But what about the fact that strlen() returns the wrong number and substr() cuts in the wrong places don't you need libraries/mb_string then (http://phputf8.sourceforge.net/ possibly)?

Perhaps I need a book on this.

Posted: Fri Aug 25, 2006 1:41 pm
by Ambush Commander
ASCII is a 7-bit code. It uses seven bits to define a character, and doesn't define what the eight bit (the one that makes it a byte) does.

The ISO-8859 standards use the eighth bit to add all sorts of interesting characters. However, since they are limited to the byte range 160 to 255, it's obviously insufficient for ideographic languages like Chinese or even just unifying several languages.

Put bluntly, UTF-8 uses the eight bit to extend the number of bytes in a character. http://en.wikipedia.org/wiki/UTF-8 gives a more in-depth explanation.

Now, to answer your question, say I have a UTF-8 string. There are lots of types of processing I can do to this string, with varying levels of success.

strlen() will return the number of bytes in the string, not necessarily the number of characters. I must stress, however, that in some cases it doesn't matter. There are workarounds (my favorite is strlen(utf8_decode($str)): utf8_decode converts all non-ASCII multibyte characters into question marks, so that in the end you have the proper number of bytes).

strpos() will always work, as long as you're dealing with a well-formed string.

substr() may or may not work depending on what you feed it. If you say substr($str, 0, 5), there's a chance that a multibyte character was started at position 5 and got truncated (turning into a totally invalid character). However, if you feed it information from, say, strpos, it will work fine. Substr works as if you were dealing with binary data: that is, it will deal with bytes not characters. As the saying goes: garbage in, garbage out.

strtoupper() and friends won't work (well, maybe if you fiddle with the locale a little, they may, but I wouldn't trust it). However, lots of PHP applications and libraries have been written for this purpose: essentially, it's just a lookup table of lowercase to uppercase (and vice-versa). You should have no trouble finding a pure-PHP solution.

Considering what you are making, which is a form generation framework, you won't be doing any heavy string manipulation, esp. the type that requires linguistic knowledge. Switching to UTF-8 should be painless.

Posted: Fri Aug 25, 2006 2:35 pm
by Ollie Saunders
Thanks AC i'm going to post back with a big post in a middle. Don't look so scared :)

Oh btw these are very good:
http://wiki.silverorange.com/UTF-8_Notes
http://www.phpwact.org/php/i18n/utf-8

Posted: Fri Aug 25, 2006 5:35 pm
by Ollie Saunders
OK the long post probably isn't coming so dw.
Considering what you are making, which is a form generation framework, you won't be doing any heavy string manipulation, esp. the type that requires linguistic knowledge. Switching to UTF-8 should be painless.
In general you are right but in my filtering class there are a couple of functions that could be painful. And its also a problem because there is a lot of code. There are 41 files and 5520 lines (I wrote a script to find that out :))

How do you manage the problem of using mb_substr() when mbstring is available and using something else when it is not on a library wide level?
So far i've added utf_decode() to strlen() calls where necessary and that hasn't taken too long but I'm thinking I may have to do something like this:

Code: Select all

<?php
class UTF8
{
    protected static $_hasMbStr;
    
    public static function init()
    {
        self::$_hasMbStr = extension_loaded('mbstring');
    }
    public static function substr($str, $start, $len)
    {
        if (self::$_hasMbStr) {
            return mb_substr($str, $start, $len);
        } else {
            // your own implementation
        }
    }
    // for all the functions
}

$newMultibyte = UTF8::substr('some multibyte string', 4, -3);
OK I read all of the Wikipedia article on UTF-8 and things are starting to make more sense now. Thanks for all your help so far AC this is really tricky stuff I couldn't do without you. So, of course, I have more questions xD
  • Do you have to define the character sets that are going to be used like you do with character encoding? Guessing no, but there seems to be plenty of opportunities for you to do so.
  • Do you think it is wise to make a library UTF-8 only? Why did you make that decision for yours?
  • I know what you mean when you say checking for wellformedness on all input but this sounds like one hell of a lot of effort and processing. How did you manage it whilst minimising duplication and avoiding double-checks?
  • Can multibyte characters be used in variables, function or class names? Should I handle this? A lot of my code is quite reflective so this could be an issue, for instance where the library user has to supply a callback should that be tested for wellformedness? I'm guessing yes and if that is the case that could be a really large problem.
  • Can I pilfer cleanUTF8()? :P
    This might be an alternative actually but I like you better than Harry Fuecks even if he does have a great name.

Posted: Sat Aug 26, 2006 7:56 am
by wei
On a different matter, and may be stating the obvious, multiple encodings on the same web page will result in lots of fun ;)

If you really need to use or inherited the use of different encodings, then you need to convert to, say UTF-8. These codes are conviently available through

http://trac.akelos.org/cgi-bin/trac.cgi ... 8_mappings

Posted: Sat Aug 26, 2006 12:01 pm
by Ambush Commander
Do you have to define the character sets that are going to be used like you do with character encoding? Guessing no, but there seems to be plenty of opportunities for you to do so.
An encoding, by necessity, defines a character set. It's a many to one relationship: UTF-8 and UTF-16 all map to Unicode, but there's no such thing as an encoding that maps to multiple charascter sets.
Do you think it is wise to make a library UTF-8 only? Why did you make that decision for yours?
No. However, I do think it's wise to do internal processing in UTF-8 and convert it to/from whatever the source encoding was. This is an issue that I am actively working to resolve in HTMLPurifier, esp. considering the prevalance of Latin-1 as the standard encoding.
I know what you mean when you say checking for wellformedness on all input but this sounds like one hell of a lot of effort and processing. How did you manage it whilst minimising duplication and avoiding double-checks?
I have it easy because I'm just doing library stuff, so there's a single point of input, making it real easy to ensure the cleaning is done, and then assume that we're dealing with good UTF-8.

There's several ways to go about doing it, I would presume that the time when you are validating/filtering user input from the POST and GET arrays is also the time to check character encodings and the like. Be sure not to do anything to binary data though ;-)

You could also create a string wrapper class, like how JavaScript strings are primitives and objects simultaneously. It sounds a bit clunky to me though.
Can multibyte characters be used in variables, function or class names? Should I handle this? A lot of my code is quite reflective so this could be an issue, for instance where the library user has to supply a callback should that be tested for wellformedness? I'm guessing yes and if that is the case that could be a really large problem.
PHP allows it, but it's considered extremely poor form. I always write my PHP files in ASCII excepting Unit Test cases which occasionally are encoded in UTF-8 without a BOM for Unicode specific tests (and even then, the only unicode characters are in strings). I wouldn't bother checking.
Can I pilfer cleanUTF8()?
Of course. Be sure to attribute the author I pilfered the base code from too!
This might be an alternative actually but I like you better than Harry Fuecks even if he does have a great name.
It's much more feature rich, I agree. It depends on what you need to do.

Posted: Sat Aug 26, 2006 1:08 pm
by Ollie Saunders
No. However, I do think it's wise to do internal processing in UTF-8 and convert it to/from whatever the source encoding was. This is an issue that I am actively working to resolve in HTMLPurifier, esp. considering the prevalance of Latin-1 as the standard encoding.
Hmmm make things even more complicated. I think what I am going to do is force my users into UTF-8 if (or more likely when) there is objection I'll put in support for others but it'll be a two-tier process this way. Beside people should be using UTF-8 isn't it a requirement of XHMTL?
I have it easy because I'm just doing library stuff, so there's a single point of input, making it real easy to ensure the cleaning is done, and then assume that we're dealing with good UTF-8.
Yeah you do have it easy. jammy bastard :P
There's several ways to go about doing it, I would presume that the time when you are validating/filtering user input from the POST and GET arrays
Not just that. I can't reliably ensure that any of the public properties or set methods will be safe so I have to check them too. :(
Be sure not to do anything to binary data though
Form submissions are all strings though there isn't any binary data.
You could also create a string wrapper class, like how JavaScript strings are primitives and objects simultaneously. It sounds a bit clunky to me though.
And me, nah I think I know how I'm going to go about it now.
PHP allows it, but it's considered extremely poor form. I always write my PHP files in ASCII excepting Unit Test cases which occasionally are encoded in UTF-8 without a BOM for Unicode specific tests (and even then, the only unicode characters are in strings). I wouldn't bother checking.
OK cool.
Of course. Be sure to attribute the author I pilfered the base code from too!
Wouldn't dream of not. Oh and thanks!
It's much more feature rich, I agree. It depends on what you need to do.
Yeah I think I am going to use this too, I'm going to OOifiy it first.