HTML Purifier on del.icio.us!

Ye' old general discussion board. Basically, for everything that isn't covered elsewhere. Come here to shoot the breeze, shoot your mouth off, or whatever suits your fancy.
This forum is not for asking programming related questions.

Moderator: General Moderators

User avatar
Oren
DevNet Resident
Posts: 1640
Joined: Fri Apr 07, 2006 5:13 am
Location: Israel

HTML Purifier on del.icio.us!

Post by Oren »

Yep, it's on the front page! http://del.icio.us/
User avatar
Maugrim_The_Reaper
DevNet Master
Posts: 2704
Joined: Tue Nov 02, 2004 5:43 am
Location: Ireland

Post by Maugrim_The_Reaper »

:lol:

Google search of "filter html php" is putting it on the second page. More linkage!
A search for "filter html xss" puts in the sweet position of 4th on the first page though ;).
User avatar
Luke
The Ninja Space Mod
Posts: 6424
Joined: Fri Aug 05, 2005 1:53 pm
Location: Paradise, CA

Post by Luke »

first page for "html filter"
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Post by alex.barylski »

I'm curious and to lazy to inspect...

But does exactly does HTMLPurifier cleanse HTML markup?

Does it use a fixed set of rules (via regex) to strip/rip and replace bad tags? From watching AC I would suspect it uses something of a fairly complicated parser to carry out it's magic?
User avatar
patrikG
DevNet Master
Posts: 4235
Joined: Thu Aug 15, 2002 5:53 am
Location: Sussex, UK

Post by patrikG »

Hockey wrote:I'm curious and to lazy to inspect...
Guess it's time to quit that habit, no?
User avatar
superdezign
DevNet Master
Posts: 4135
Joined: Sat Jan 20, 2007 11:06 pm

Post by superdezign »

Hockey wrote:I'm curious and to lazy to inspect...

But does exactly does HTMLPurifier cleanse HTML markup?

Does it use a fixed set of rules (via regex) to strip/rip and replace bad tags? From watching AC I would suspect it uses something of a fairly complicated parser to carry out it's magic?
Download it and look at it. I downloaded it, but haven't had a chance to dissect any of it yet. My assumption is that it uses tokenization and matching tokens to determine proper nesting, checking for invalid tags and attributes against allowed tags and attributes, and probably replacing cluttered elements with more efficient elements (as for the last one, if it doesn't then it should be on a wish list :P). I'd assume that it'd work well with WYSIWYG HTML editors like FCKeditor in turning the ugly HTML into valid HTML.
User avatar
RobertGonzalez
Site Administrator
Posts: 14293
Joined: Tue Sep 09, 2003 6:04 pm
Location: Fremont, CA, USA

Post by RobertGonzalez »

Hockey wrote:I'm curious and to lazy to inspect...
Then we're too lazy to answer. Go download it. Wait three days for your laziness to subside. Wait three more days, then post another question about an application that you are too lazy to check out for yourself. :wink:

Cheers :D
User avatar
superdezign
DevNet Master
Posts: 4135
Joined: Sat Jan 20, 2007 11:06 pm

Post by superdezign »

Everah wrote:Cheers :D
Hehe.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

That's pretty cool. It basically tripled the amount of del.icio.us bookmarks htmlpurifier.org has. :-D (It's tough to track del.icio.us referrals, though, because they don't happen immediately. After all, it is a bookmarking site)

Google does weird things to my website. Before I started excluding their bot from my website, it was constituting 70% of my site traffic (great ego boost, but not so informative). They send me the most referrals, though, so I'm not complaining. The top generic search term is "php html filter", after that it's "embed youtube html" :lol:

Hockey, if you want to know about HTML Purifier's internals in a nutshell, it's basically
1. Parse document into an array of tag and text tokens (Lexer)
2. Remove all elements not on whitelist and transform certain other elements
into acceptable forms (i.e. <font>)
3. Make document well formed while helpfully taking into account certain quirks,
such as the fact that <p> tags traditionally are closed by other block-level
elements.
4. Run through all nodes and check children for proper order (especially
important for tables).
5. Validate attributes according to more restrictive definitions based on the
RFCs.
6. Translate back into a string. (Generator)
...and a lot of little details.
User avatar
superdezign
DevNet Master
Posts: 4135
Joined: Sat Jan 20, 2007 11:06 pm

Post by superdezign »

Sounds impressive. I just recently wrote a parser with a tokenizer for my blog's tags and I'd love to find a more efficient method of tokenizing (unless it's only possible through recursion... Then I'm doing fine. :P)
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Well, if you can, use PHP's DOM extension to parse possibly poorly formed HTML. It's much faster, since it's implemented natively in C. From the sound of things, however, it looks like you're parsing and making the document well formed at the same time (otherwise, recursion would not be necessary).
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

Post by alex.barylski »

superdezign wrote:Sounds impressive. I just recently wrote a parser with a tokenizer for my blog's tags and I'd love to find a more efficient method of tokenizing (unless it's only possible through recursion... Then I'm doing fine. :P)
Recursion is evil if what you are after is optimized code :)

AC does your HTMLPurifier use the DOM?
User avatar
patrikG
DevNet Master
Posts: 4235
Joined: Thu Aug 15, 2002 5:53 am
Location: Sussex, UK

Post by patrikG »

Hockey wrote:Recursion is evil if what you are after is optimized code :)
Why?
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Ahh... then I have sinned. HTML Purifier, if it detects PHP5 and DOM, will use DOM to parse HTML. Then I traverse the DOM and translate it back into tokens (using recursion) that get processed later on (design decision I made early on). I use some reference magic to keep things zippy though, and it still beats out the pure-PHP parser every time.
User avatar
Oren
DevNet Resident
Posts: 1640
Joined: Fri Apr 07, 2006 5:13 am
Location: Israel

Post by Oren »

And now it's on SitePoint... http://www.sitepoint.com/blogs/2007/07/ ... ne-cometh/
(6th on the list)
Congrats Ambush Commander :wink:
Post Reply