Holes in preg_match filters

Discussions of secure PHP coding. Security in software is important, so don't be afraid to ask. And when answering: be anal. Nitpick. No security vulnerability is too small.

Moderator: General Moderators

User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

I can't take credit for that typecasting quirk: it's been bouncing around ha.ckers.org for some time. A few more gems:

Code: Select all

$id = $content = 'default';
if(isset($_GET['id'])) {
    $id = stripslashes($_GET['id']);
}
if(isset($_GET['content'])) {
    $content = stripslashes($_GET['content']);
}
if(!eregi('^[a-z[]]*$',$id) || !eregi('^[a-z[]]*$',$content)) {
    $id = $content = 'default';
}
Is vulnerable because the ereg functions are not binary safe, and thus a null byte will cause the ereg engine to ignore anything coming after. The user-education bit that comes along with this is, quite simply, don't use the ereg functions!

Code: Select all

$id = $content = 'default';
if(isset($_GET['id'])) {
    $id = stripslashes($_GET['id']);
}
if(isset($_GET['content'])) {
    $content = stripslashes($_GET['content']);
}
<p id="<?php echo htmlspecialchars($id,ENT_QUOTES); ?>"><?php echo strip_tags($content); ?></p>
Which is vulnerable in Internet Explorer 6 due to character encoding issues: try passing ?id=%EF&content=%22%20onmouseover=%22alert('xss')%22>asdf to the script and you'll see what I mean. The moral of the story here is: 1. Don't allow unescaped quotes in your text, 2. Don't use strip_tags and 3. Check all your strings for UTF-8 well-formedness!

While I love PHP and believe it's absolutely awesome, these are some cases where certain architectural designs have resulted in seemingly innocuous bits of code being security risks. I was surprised by all three of these problems when I first heard about them, as well as when I heard about this new one by Esser. I believe it's important that we not write these problems off as "real, but difficult to exploit." Instead, we should file them away, and provide them as compelling reasons why newbies should adopt what we call "best practices".
User avatar
shiflett
Forum Contributor
Posts: 124
Joined: Sun Feb 06, 2005 11:22 am

Post by shiflett »

Regarding the IE bug with UTF-8, every example exploit I've seen is too contrived for me to appreciate the problem, but I know several experts who consider it to be one of the biggest discoveries of 2006.

Can you provide an example exploit for something as simple as this?

Code: Select all

header('Content-Type: text/html; charset=UTF-8');
echo htmlentities($_GET['foo'], ENT_QUOTES, 'UTF-8');
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

No. The multibyte character issue requires that there already are tags on the page, and that some user-input is being put into the attributes of the tags.

Some extra research:

The addition of the UTF-8 parameter as the character encoding has some interesting effects, that I was not aware of before. In the end, both functions are still vulnerable, but to a narrower range of bytes. Here were my results:

For htmlentities:
* Characters 254 and 255 had entities, so were completely invulnerable
* Characters 192-253 did not, but had a null byte appended after them when they were at the end of the string, i.e.

Code: Select all

echo urlencode(htmlentities(chr(196), ENT_QUOTES, 'UTF-8'));
Outputs: %C4%00

For htmlspecialchars:
* Characters 254 and 255 did not have any extra processing to them
* Characters 192-253 also had the null byte appended to them when they appeared at the end of the string

Therefore, htmlspecialchars was already vulnerable.

After a bit more testing, I realized that for certain characters, Internet Explorer 6 would still glob up the quote anyway. These were 234--255 (although 254 and 255 were effectively useless as they were entity-ized by htmlentities). So htmlentities is still vulnerable. After some thinking, I think the reason why this is the case is because those bytes are the three-byte character sequence openers, so Internet Explorer gobbles up the next two characters unconditionally.

The best set of bytes to test with, then are 234--253, which will work in all cases.

As a follow-up exercise, I would like to check out the underlying C code for these two functions and see what precisely is going on. What I find really interesting is that:

Code: Select all

echo urlencode(htmlentities(chr(253).'a', ENT_QUOTES, 'UTF-8'));
..outputs %FDa (no null byte). This is probably because the interpreter does a dumb check at the end to make sure that the character sequence is well-formed, and adds a null byte if isn't.
Post Reply