HTML Purifier on del.icio.us!
Moderator: General Moderators
HTML Purifier on del.icio.us!
Yep, it's on the front page! http://del.icio.us/
- Maugrim_The_Reaper
- DevNet Master
- Posts: 2704
- Joined: Tue Nov 02, 2004 5:43 am
- Location: Ireland
-
alex.barylski
- DevNet Evangelist
- Posts: 6267
- Joined: Tue Dec 21, 2004 5:00 pm
- Location: Winnipeg
- superdezign
- DevNet Master
- Posts: 4135
- Joined: Sat Jan 20, 2007 11:06 pm
Download it and look at it. I downloaded it, but haven't had a chance to dissect any of it yet. My assumption is that it uses tokenization and matching tokens to determine proper nesting, checking for invalid tags and attributes against allowed tags and attributes, and probably replacing cluttered elements with more efficient elements (as for the last one, if it doesn't then it should be on a wish listHockey wrote:I'm curious and to lazy to inspect...
But does exactly does HTMLPurifier cleanse HTML markup?
Does it use a fixed set of rules (via regex) to strip/rip and replace bad tags? From watching AC I would suspect it uses something of a fairly complicated parser to carry out it's magic?
- RobertGonzalez
- Site Administrator
- Posts: 14293
- Joined: Tue Sep 09, 2003 6:04 pm
- Location: Fremont, CA, USA
- superdezign
- DevNet Master
- Posts: 4135
- Joined: Sat Jan 20, 2007 11:06 pm
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
That's pretty cool. It basically tripled the amount of del.icio.us bookmarks htmlpurifier.org has.
(It's tough to track del.icio.us referrals, though, because they don't happen immediately. After all, it is a bookmarking site)
Google does weird things to my website. Before I started excluding their bot from my website, it was constituting 70% of my site traffic (great ego boost, but not so informative). They send me the most referrals, though, so I'm not complaining. The top generic search term is "php html filter", after that it's "embed youtube html"
Hockey, if you want to know about HTML Purifier's internals in a nutshell, it's basically
Google does weird things to my website. Before I started excluding their bot from my website, it was constituting 70% of my site traffic (great ego boost, but not so informative). They send me the most referrals, though, so I'm not complaining. The top generic search term is "php html filter", after that it's "embed youtube html"
Hockey, if you want to know about HTML Purifier's internals in a nutshell, it's basically
...and a lot of little details.1. Parse document into an array of tag and text tokens (Lexer)
2. Remove all elements not on whitelist and transform certain other elements
into acceptable forms (i.e. <font>)
3. Make document well formed while helpfully taking into account certain quirks,
such as the fact that <p> tags traditionally are closed by other block-level
elements.
4. Run through all nodes and check children for proper order (especially
important for tables).
5. Validate attributes according to more restrictive definitions based on the
RFCs.
6. Translate back into a string. (Generator)
- superdezign
- DevNet Master
- Posts: 4135
- Joined: Sat Jan 20, 2007 11:06 pm
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
-
alex.barylski
- DevNet Evangelist
- Posts: 6267
- Joined: Tue Dec 21, 2004 5:00 pm
- Location: Winnipeg
Recursion is evil if what you are after is optimized codesuperdezign wrote:Sounds impressive. I just recently wrote a parser with a tokenizer for my blog's tags and I'd love to find a more efficient method of tokenizing (unless it's only possible through recursion... Then I'm doing fine.)
AC does your HTMLPurifier use the DOM?
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Ahh... then I have sinned. HTML Purifier, if it detects PHP5 and DOM, will use DOM to parse HTML. Then I traverse the DOM and translate it back into tokens (using recursion) that get processed later on (design decision I made early on). I use some reference magic to keep things zippy though, and it still beats out the pure-PHP parser every time.
And now it's on SitePoint... http://www.sitepoint.com/blogs/2007/07/ ... ne-cometh/
(6th on the list)
Congrats Ambush Commander
(6th on the list)
Congrats Ambush Commander