HTMLPurifier - Take your best shot
Moderator: General Moderators
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
In terms of XSS. In the live demo, there really is barely any security threat because the HTML you post is not shown to anyone else. But the code that powers the demo can be used for other things that do involve other viewers, so any Javascript (or other baddies) that get through are security problems.
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
That's extremely strange! Because I tested it with Chinese characters a while back and it worked properly (that's not working now either). But you're correct. I'll have to see where the encoding goes wrong.does not output (or accept?) russian characters properly:
Edit - After further investigation, UTF-8 works in PHP 4 but not in PHP 5. Hrmm...
Edit 2 - I know why: the DOM extension is not UTF-8 safe without more configuration! This'll be easy to fix.
Yep. This is due to the design of HTMLPurifier where parsing the HTML happens first and could cause information to be lost. Eradication would be the simplest way, smart textification would require more coding.I think it should either 'textify' as is or eradicate altogether not allowed tags. At the moment it looks like it first purifies it and then textifies. Take for example <iframe>
That's interesting, but I know precisely why it's happening. I don't think I'm going to bother fixing it (besides a trivial check or two). This is because when running PHP 5, the extension uses DOM to parse the text, which means wrapping the text in <html> and <body>. So then your code looks like: "<html><body><div><img src="</div></body></html>" Maybe I can do without those.chokes on incomplete attributes:
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
- feyd
- Neighborhood Spidermoddy
- Posts: 31559
- Joined: Mon Mar 29, 2004 3:24 pm
- Location: Bothell, Washington, USA
Volka's posted html code here generates some interesting output
My own basic purifier (that was only concerned with removing attributes and tags that weren't wanted) might be of interest too. This was built some time ago though, without any updates since.
http://code.tatzu.net/cleantags/
With default settings, cleantags outputs
Code: Select all
xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <title>xyz</title><form method="post" action="whatever1">
<input type="text" name="username" /><input type="text" name="password" /><input type="submit" />
</form> <form method="post" action="whatever2">
<input type="text" name="username" /><input type="text" name="password" /><input type="submit" />
</form>http://code.tatzu.net/cleantags/
With default settings, cleantags outputs
Code: Select all
xyz
<div>
<input type="text" name="username">
<input type="text" name="password">
<input type="submit">
</div>
<div>
<input type="text" name="username">
<input type="text" name="password">
<input type="submit">
</div>
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
This is a symptom of a core problem/missing feature, namely, the ability to recognize that input is a well-formed document rather than a fragment and then parse it accordingly. I'll put this on high priority.Volka's posted html code here generates some interesting output
While I'd rather not criticize Feyd, this is precisely the type of fundamentally flawed filter I wanted to replace with this library. Blacklist just simply does not work (in terms of protecting against XSS). However, the behavior seems quite intuitive, so I'll try to model default behavior after that.My own basic purifier (that was only concerned with removing attributes and tags that weren't wanted) might be of interest too.
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Agreed.feyd wrote:for most things, I think removing them altogether is preferred over escaping. As long as it's easily switched, it's all good.
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
- Ollie Saunders
- DevNet Master
- Posts: 3179
- Joined: Tue May 24, 2005 6:01 pm
- Location: UK
OK excuse my ignorance but what is the intended audience, purpose and expected application of HTMLPurifier?
It does seem amazing, and its astonishing what you have achieved in such a short time, but why do I need HTMLPurifier?
Edit: Oh it would be nice if it indented the HTML properly for you and removed all other whitespace.
It does seem amazing, and its astonishing what you have achieved in such a short time, but why do I need HTMLPurifier?
Edit: Oh it would be nice if it indented the HTML properly for you and removed all other whitespace.
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
In order to insure user-submitted HTML is safe for output, both in terms of XSS and Validation. Heck, you could even send trusted content through it just to make sure the page validates.OK excuse my ignorance but what is the intended audience, purpose and expected application of HTMLPurifier?
It does seem amazing, and its astonishing what you have achieved in such a short time, but why do I need HTMLPurifier?
That's a feature that I've been thinking about. I'll probably have it done before the stable release.Oh it would be nice if it indented the HTML properly for you and removed all other whitespace.
