Page 1 of 1
Problems with Variable Width Encodings
Posted: Thu Aug 17, 2006 2:23 pm
by Ambush Commander
http://ha.ckers.org/blog/20060817/varia ... -encoding/
I haven't had time to read it in depth, but it sounds pretty scary...
Posted: Thu Aug 17, 2006 2:31 pm
by feyd
It's not too bad... but then again, I consider most things not too bad.

..
Posted: Thu Aug 17, 2006 3:26 pm
by Ambush Commander
Hmm... after reading it more, it seems that checking a document for UTF-8 well-formedness should be enough. Edit - Actually, ensuring all quotes are quoted would be a better idea.
Posted: Sat Sep 23, 2006 8:55 pm
by shiflett
Here's an earlier blog post I made about this issue:
http://shiflett.org/archive/178
I think it's a simple, clear example. Hope it helps.
Posted: Sat Sep 23, 2006 9:00 pm
by Ambush Commander
htmlentities() is fundamentally flawed, though for a different reason: it doesn't handle control characters, such as a null byte:
Code: Select all
<?php echo strlen(htmlentities("\0")); ?>
My policy is to run things through a specialized escape() function that first calls an encoding checker to make it well-formed and remove non-SGML codepoitns before calling htmlspecialchars (proper char encoding passed, of course).
Posted: Sun Sep 24, 2006 1:51 am
by matthijs
Ambush, care to show/explain us more about your escape function?
And about the problems with htmlentities, I assume this problem is only a problem when somehow the server doesn't sent the documents as UTF-8? That's the default isn't it?
Posted: Sun Sep 24, 2006 12:19 pm
by Ambush Commander
The condensed version (only for UTF-8 and requires iconv to be installed, is thus):
Code: Select all
function unichr($code) {
if($code > 1114111 or $code < 0 or
($code >= 55296 and $code <= 57343) ) {
// bits are set outside the "valid" range as defined
// by UNICODE 4.1.0
return '';
}
$x = $y = $z = $w = 0;
if ($code < 128) {
// regular ASCII character
$x = $code;
} else {
// set up bits for UTF-8
$x = ($code & 63) | 128;
if ($code < 2048) {
$y = (($code & 2047) >> 6) | 192;
} else {
$y = (($code & 4032) >> 6) | 128;
if($code < 65536) {
$z = (($code >> 12) & 15) | 224;
} else {
$z = (($code >> 12) & 63) | 128;
$w = (($code >> 18) & 7) | 240;
}
}
}
// set up the actual character
$ret = '';
if($w) $ret .= chr($w);
if($z) $ret .= chr($z);
if($y) $ret .= chr($y);
$ret .= chr($x);
return $ret;
}
function escape($str) {
static $non_sgml_chars = array();
if (empty($non_sgml_chars)) {
for ($i = 0; $i <= 31; $i++) {
// non-SGML ASCII chars
// save \r, \t and \n
if ($i == 9 || $i == 13 || $i == 10) continue;
$non_sgml_chars[chr($i)] = '';
}
for ($i = 127; $i <= 159; $i++) {
$non_sgml_chars[unichr($i)] = '';
}
}
$str = @iconv('UTF-8', 'UTF-8//IGNORE', $str);
$str = strtr($str, $non_sgml_chars);
return htmlspecialchars($str, ENT_COMPAT, 'UTF-8');
}
I've also got an implementation that works when iconv is not installed. And no, htmlentities just doesn't work. Period.
Posted: Sun Sep 24, 2006 3:10 pm
by matthijs
Thanks for showing Ambush. Will study that.
Ambush Commander wrote:And no, htmlentities just doesn't work. Period
That's quite a bold statement, isn't it? Why is every security advise to use it then? I do remember reading (for example a book by Chris) in which htmlentities is used to prevent xss etc. You say each site which uses htmlentities to escape output to html is still vulnerable to xss or other kinds of attacks?
That's quite something ...
Posted: Sun Sep 24, 2006 3:15 pm
by Ambush Commander
That's quite a bold statement, isn't it? Why is every security advise to use it then? I do remember reading (for example a book by Chris) in which htmlentities is used to prevent xss etc. You say each site which uses htmlentities to escape output to html is still vulnerable to xss or other kinds of attacks?
That's quite something ...
It is quite bold. Although I am
not saying that suddenly any site that uses htmlentities() is suddenly vulnerable to XSS, there are two ramifications of using a bare-naked htmlentities:
1. User input can extremely easily break the validation of pages. No matter how well-constructed the rest of your layout is, null bytes aren't treated very kindly. Even worse is if the string is malformed: the validator may refuse to check your page at all. While browsers are quite forgiving, this is not the case with, say, XML-readers.
2. Under
certain conditions (for example, the abovementioned post), especially when user input is put in to the attributes of HTML tags, XSS is enabled.
Since htmlentities() claims to "Convert all applicable characters to HTML entities", with the implicit assumption that anything passed through htmlentities() is safe to output, I would say yes: htmlentities() is fundamentally broken.