Page 1 of 2
HTML Purifier on del.icio.us!
Posted: Thu Jun 28, 2007 11:01 am
by Oren
Yep, it's on the front page!
http://del.icio.us/
Posted: Thu Jun 28, 2007 11:42 am
by Maugrim_The_Reaper
Google search of "filter html php" is putting it on the second page. More linkage!
A search for "filter html xss" puts in the sweet position of 4th on the first page though

.
Posted: Thu Jun 28, 2007 12:33 pm
by Luke
first page for "html filter"
Posted: Thu Jun 28, 2007 8:38 pm
by alex.barylski
I'm curious and to lazy to inspect...
But does exactly does HTMLPurifier cleanse HTML markup?
Does it use a fixed set of rules (via regex) to strip/rip and replace bad tags? From watching AC I would suspect it uses something of a fairly complicated parser to carry out it's magic?
Posted: Fri Jun 29, 2007 5:38 am
by patrikG
Hockey wrote:I'm curious and to lazy to inspect...
Guess it's time to quit that habit, no?
Posted: Fri Jun 29, 2007 10:13 am
by superdezign
Hockey wrote:I'm curious and to lazy to inspect...
But does exactly does HTMLPurifier cleanse HTML markup?
Does it use a fixed set of rules (via regex) to strip/rip and replace bad tags? From watching AC I would suspect it uses something of a fairly complicated parser to carry out it's magic?
Download it and look at it. I downloaded it, but haven't had a chance to dissect any of it yet. My assumption is that it uses tokenization and matching tokens to determine proper nesting, checking for invalid tags and attributes against allowed tags and attributes, and probably replacing cluttered elements with more efficient elements (as for the last one, if it doesn't then it should be on a wish list

). I'd assume that it'd work well with WYSIWYG HTML editors like FCKeditor in turning the ugly HTML into valid HTML.
Posted: Fri Jun 29, 2007 11:24 am
by RobertGonzalez
Hockey wrote:I'm curious and to lazy to inspect...
Then we're too lazy to answer. Go download it. Wait three days for your laziness to subside. Wait three more days, then post another question about an application that you are too lazy to check out for yourself.
Cheers

Posted: Fri Jun 29, 2007 11:58 am
by superdezign
Everah wrote:Cheers

Hehe.
Posted: Fri Jun 29, 2007 6:38 pm
by Ambush Commander
That's pretty cool. It basically tripled the amount of del.icio.us bookmarks htmlpurifier.org has.

(It's tough to track del.icio.us referrals, though, because they don't happen immediately. After all, it is a bookmarking site)
Google does weird things to my website. Before I started excluding their bot from my website, it was constituting 70% of my site traffic (great ego boost, but not so informative). They send me the most referrals, though, so I'm not complaining. The top generic search term is "php html filter", after that it's "embed youtube html"
Hockey, if you want to know about HTML Purifier's internals in a nutshell, it's basically
1. Parse document into an array of tag and text tokens (Lexer)
2. Remove all elements not on whitelist and transform certain other elements
into acceptable forms (i.e. <font>)
3. Make document well formed while helpfully taking into account certain quirks,
such as the fact that <p> tags traditionally are closed by other block-level
elements.
4. Run through all nodes and check children for proper order (especially
important for tables).
5. Validate attributes according to more restrictive definitions based on the
RFCs.
6. Translate back into a string. (Generator)
...and a
lot of little details.
Posted: Fri Jun 29, 2007 7:03 pm
by superdezign
Sounds impressive. I just recently wrote a parser with a tokenizer for my blog's tags and I'd love to find a more efficient method of tokenizing (unless it's only possible through recursion... Then I'm doing fine.

)
Posted: Fri Jun 29, 2007 7:07 pm
by Ambush Commander
Well, if you can, use PHP's DOM extension to parse possibly poorly formed HTML. It's much faster, since it's implemented natively in C. From the sound of things, however, it looks like you're parsing and making the document well formed at the same time (otherwise, recursion would not be necessary).
Posted: Fri Jun 29, 2007 7:16 pm
by alex.barylski
superdezign wrote:Sounds impressive. I just recently wrote a parser with a tokenizer for my blog's tags and I'd love to find a more efficient method of tokenizing (unless it's only possible through recursion... Then I'm doing fine.

)
Recursion is evil if what you are after is optimized code
AC does your HTMLPurifier use the DOM?
Posted: Fri Jun 29, 2007 7:20 pm
by patrikG
Hockey wrote:Recursion is evil if what you are after is optimized code

Why?
Posted: Fri Jun 29, 2007 7:20 pm
by Ambush Commander
Ahh... then I have sinned. HTML Purifier, if it detects PHP5 and DOM, will use DOM to parse HTML. Then I traverse the DOM and translate it back into tokens (using recursion) that get processed later on (design decision I made early on). I use some reference magic to keep things zippy though, and it still beats out the pure-PHP parser every time.
Posted: Wed Jul 04, 2007 3:14 am
by Oren
And now it's on SitePoint...
http://www.sitepoint.com/blogs/2007/07/ ... ne-cometh/
(6th on the list)
Congrats
Ambush Commander 