HTMLPurifier - Take your best shot

Discussions of secure PHP coding. Security in software is important, so don't be afraid to ask. And when answering: be anal. Nitpick. No security vulnerability is too small.

Moderator: General Moderators

User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

What do you mean by "test for security"?
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

In terms of XSS. In the live demo, there really is barely any security threat because the HTML you post is not shown to anyone else. But the code that powers the demo can be used for other things that do involve other viewers, so any Javascript (or other baddies) that get through are security problems.
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

does not output (or accept?) russian characters properly:
Image
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

I think it should either 'textify' as is or eradicate altogether not allowed tags. At the moment it looks like it first purifies it and then textifies. Take for example <iframe>
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

chokes on incomplete attributes:
<img src="
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

does not output (or accept?) russian characters properly:
That's extremely strange! Because I tested it with Chinese characters a while back and it worked properly (that's not working now either). But you're correct. I'll have to see where the encoding goes wrong.

Edit - After further investigation, UTF-8 works in PHP 4 but not in PHP 5. Hrmm...

Edit 2 - I know why: the DOM extension is not UTF-8 safe without more configuration! This'll be easy to fix.
I think it should either 'textify' as is or eradicate altogether not allowed tags. At the moment it looks like it first purifies it and then textifies. Take for example <iframe>
Yep. This is due to the design of HTMLPurifier where parsing the HTML happens first and could cause information to be lost. Eradication would be the simplest way, smart textification would require more coding.
chokes on incomplete attributes:
That's interesting, but I know precisely why it's happening. I don't think I'm going to bother fixing it (besides a trivial check or two). This is because when running PHP 5, the extension uses DOM to parse the text, which means wrapping the text in <html> and <body>. So then your code looks like: "<html><body><div><img src="</div></body></html>" Maybe I can do without those.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

International characters now work in PHP 5.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

Volka's posted html code here generates some interesting output

Code: Select all

xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">  <title>xyz</title><form method="post" action="whatever1">
<input type="text" name="username" /><input type="text" name="password" /><input type="submit" />
</form> <form method="post" action="whatever2">
<input type="text" name="username" /><input type="text" name="password" /><input type="submit" />
</form>
My own basic purifier (that was only concerned with removing attributes and tags that weren't wanted) might be of interest too. This was built some time ago though, without any updates since.

http://code.tatzu.net/cleantags/

With default settings, cleantags outputs

Code: Select all



   
      xyz
   
   
      
         <div>
            <input type="text" name="username">
            <input type="text" name="password">
            <input type="submit">
         </div>
      
      
         <div>
            <input type="text" name="username">
            <input type="text" name="password">
            <input type="submit">
         </div>
      
   
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Volka's posted html code here generates some interesting output
This is a symptom of a core problem/missing feature, namely, the ability to recognize that input is a well-formed document rather than a fragment and then parse it accordingly. I'll put this on high priority.
My own basic purifier (that was only concerned with removing attributes and tags that weren't wanted) might be of interest too.
While I'd rather not criticize Feyd, this is precisely the type of fundamentally flawed filter I wanted to replace with this library. Blacklist just simply does not work (in terms of protecting against XSS). However, the behavior seems quite intuitive, so I'll try to model default behavior after that.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

I didn't write it to protect against XSS. Just remove unwanted tags, attributes and attribute values. So no worries about hurting my feelings.

It does filter out several XSS attempts however. :)
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

It does filter out several XSS attempts however.
Yep. The most common ones are caughtt.

Hmm... in terms of removing versus escaping bad tags, which should be default?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

for most things, I think removing them altogether is preferred over escaping. As long as it's easily switched, it's all good. :)
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Post by s.dot »

feyd wrote:for most things, I think removing them altogether is preferred over escaping. As long as it's easily switched, it's all good. :)
Agreed.
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Post by Ollie Saunders »

OK excuse my ignorance but what is the intended audience, purpose and expected application of HTMLPurifier?
It does seem amazing, and its astonishing what you have achieved in such a short time, but why do I need HTMLPurifier?

Edit: Oh it would be nice if it indented the HTML properly for you and removed all other whitespace.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

OK excuse my ignorance but what is the intended audience, purpose and expected application of HTMLPurifier?
It does seem amazing, and its astonishing what you have achieved in such a short time, but why do I need HTMLPurifier?
In order to insure user-submitted HTML is safe for output, both in terms of XSS and Validation. Heck, you could even send trusted content through it just to make sure the page validates.
Oh it would be nice if it indented the HTML properly for you and removed all other whitespace.
That's a feature that I've been thinking about. I'll probably have it done before the stable release.
Post Reply