Problems with Variable Width Encodings

Discussions of secure PHP coding. Security in software is important, so don't be afraid to ask. And when answering: be anal. Nitpick. No security vulnerability is too small.

Moderator: General Moderators

Post Reply
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Problems with Variable Width Encodings

Post by Ambush Commander »

http://ha.ckers.org/blog/20060817/varia ... -encoding/

I haven't had time to read it in depth, but it sounds pretty scary...
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

It's not too bad... but then again, I consider most things not too bad. :)..
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Hmm... after reading it more, it seems that checking a document for UTF-8 well-formedness should be enough. Edit - Actually, ensuring all quotes are quoted would be a better idea.
User avatar
shiflett
Forum Contributor
Posts: 124
Joined: Sun Feb 06, 2005 11:22 am

Post by shiflett »

Here's an earlier blog post I made about this issue:

http://shiflett.org/archive/178

I think it's a simple, clear example. Hope it helps.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

htmlentities() is fundamentally flawed, though for a different reason: it doesn't handle control characters, such as a null byte:

Code: Select all

<?php echo strlen(htmlentities("\0")); ?>
My policy is to run things through a specialized escape() function that first calls an encoding checker to make it well-formed and remove non-SGML codepoitns before calling htmlspecialchars (proper char encoding passed, of course).
matthijs
DevNet Master
Posts: 3360
Joined: Thu Oct 06, 2005 3:57 pm

Post by matthijs »

Ambush, care to show/explain us more about your escape function?

And about the problems with htmlentities, I assume this problem is only a problem when somehow the server doesn't sent the documents as UTF-8? That's the default isn't it?
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

The condensed version (only for UTF-8 and requires iconv to be installed, is thus):

Code: Select all

function unichr($code) {
    if($code > 1114111 or $code < 0 or
      ($code >= 55296 and $code <= 57343) ) {
        // bits are set outside the "valid" range as defined
        // by UNICODE 4.1.0 
        return '';
    }
    
    $x = $y = $z = $w = 0; 
    if ($code < 128) {
        // regular ASCII character
        $x = $code;
    } else {
        // set up bits for UTF-8
        $x = ($code & 63) | 128;
        if ($code < 2048) {
            $y = (($code & 2047) >> 6) | 192;
        } else {
            $y = (($code & 4032) >> 6) | 128;
            if($code < 65536) {
                $z = (($code >> 12) & 15) | 224;
            } else {
                $z = (($code >> 12) & 63) | 128;
                $w = (($code >> 18) & 7)  | 240;
            }
        } 
    }
    // set up the actual character
    $ret = '';
    if($w) $ret .= chr($w);
    if($z) $ret .= chr($z);
    if($y) $ret .= chr($y);
    $ret .= chr($x); 
    
    return $ret;
}

function escape($str) {
    static $non_sgml_chars = array();
    if (empty($non_sgml_chars)) {
        for ($i = 0; $i <= 31; $i++) {
            // non-SGML ASCII chars
            // save \r, \t and \n
            if ($i == 9 || $i == 13 || $i == 10) continue;
            $non_sgml_chars[chr($i)] = '';
        }
        for ($i = 127; $i <= 159; $i++) {
            $non_sgml_chars[unichr($i)] = '';
        }
    }
    $str = @iconv('UTF-8', 'UTF-8//IGNORE', $str);
    $str = strtr($str, $non_sgml_chars);
    return htmlspecialchars($str, ENT_COMPAT, 'UTF-8');
}
I've also got an implementation that works when iconv is not installed. And no, htmlentities just doesn't work. Period.
matthijs
DevNet Master
Posts: 3360
Joined: Thu Oct 06, 2005 3:57 pm

Post by matthijs »

Thanks for showing Ambush. Will study that.
Ambush Commander wrote:And no, htmlentities just doesn't work. Period
That's quite a bold statement, isn't it? Why is every security advise to use it then? I do remember reading (for example a book by Chris) in which htmlentities is used to prevent xss etc. You say each site which uses htmlentities to escape output to html is still vulnerable to xss or other kinds of attacks?

That's quite something ...
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

That's quite a bold statement, isn't it? Why is every security advise to use it then? I do remember reading (for example a book by Chris) in which htmlentities is used to prevent xss etc. You say each site which uses htmlentities to escape output to html is still vulnerable to xss or other kinds of attacks?

That's quite something ...
It is quite bold. Although I am not saying that suddenly any site that uses htmlentities() is suddenly vulnerable to XSS, there are two ramifications of using a bare-naked htmlentities:

1. User input can extremely easily break the validation of pages. No matter how well-constructed the rest of your layout is, null bytes aren't treated very kindly. Even worse is if the string is malformed: the validator may refuse to check your page at all. While browsers are quite forgiving, this is not the case with, say, XML-readers.
2. Under certain conditions (for example, the abovementioned post), especially when user input is put in to the attributes of HTML tags, XSS is enabled.

Since htmlentities() claims to "Convert all applicable characters to HTML entities", with the implicit assumption that anything passed through htmlentities() is safe to output, I would say yes: htmlentities() is fundamentally broken.
Post Reply