HTMLCleaner class - clean WYSIWYG generated code

Coding Critique is the place to post source code for peer review by other members of DevNetwork. Any kind of code can be posted. Code posted does not have to be limited to PHP. All members are invited to contribute constructive criticism with the goal of improving the code. Posted code should include some background information about it and what areas you specifically would like help with.

Popular code excerpts may be moved to "Code Snippets" by the moderators.

Moderator: General Moderators

Post Reply
luci
Forum Newbie
Posts: 2
Joined: Wed Mar 19, 2008 10:35 am

HTMLCleaner class - clean WYSIWYG generated code

Post by luci »

I'm working for a while in my spare time on a PHP Class for cleaning uggly WYSIWYG editor generated code (especially MS Word)

I’ve combined the strong HTML Tidy library with my own regular expression-based cleaning algorithms. I wanted a simple method to strip all unnecesarry tags and styles yet to keep it W3C standard compliant.

Synthax checking is beeing done only when using Tidy.
Note that this tool is designed to strip/clean useless tags and attributes back to HTML basics and optimize code, not sanitize (like HTMLPurifier).

Without the tidy PHP extension, the class can:
- remove styles, attributes
- strip useless tags
- fill empty table cells with non-breaking spaces
- optimize code (merge inline tags, strip empty inline tags, trim excess new lines)
- drop empty paragraphs
- compress (trim space and new-line breaks).

In conjunction with tidy, the class can apply all tidy actions (clean-up, fix errors, convert to XHTML, etc) and then optionally perform all actions of the class (remove styles, compress, etc).

Currently the following cleaning method is implemented: tag whitelist/attribute blacklist

Properties:

Code: Select all

 
var $html;
var $Options;
var $Tag_whitelist=‘<table><tbody><thead><tfoot><tr><th><td><colgroup><col>
<p>
<hr><blockquote>
<b><i><u><sub><sup><strong><em><tt><var>
<code><xmp><cite><pre><abbr><acronym><address><samp>
<fieldset><legend>
<a><img>
<h1><h2><h3><h4><h4><h5><h6>
<ul><ol><li><dl><dt>
<frame><frameset>
<form><input><select><option><optgroup><button><textarea>’;
var $Attrib_blacklist=‘id|on[\w]+’;
var $CleanUpTags=array(‘a’,’span’,‘b’,‘i’,‘u’,’strong’,‘em’,‘big’,’small’,‘tt’,‘var’,‘code’,‘xmp’,‘cite’,‘pre’,‘abbr’,‘acronym’,‘address’,‘q’,’samp’,’sub’,’sup’);//array of inline tags that can be merged
var $TidyConfig;
var $Encoding=‘latin1?;
 
$this->Options = array(
                        ‘RemoveStyles’    => true,  //removes style definitions like style and class
                        ‘IsWord’                => true,        //Microsoft Word flag - specific operations may occur
                        ‘UseTidy’              => true,       //uses the tidy engine also to cleanup the source (reccomended)
                        ‘CleaningMethod’        => array(TAG_WHITELIST,ATTRIB_BLACKLIST),       //cleaning methods
                        ‘OutputXHTML’      => true,   //converts to XHTML by using TIDY.
                        ‘FillEmptyTableCells’ => true,  //fills empty cells with non-breaking spaces
                        ‘DropEmptyParas’        => true,        //drops empty paragraphs
                        ‘Optimize’                  =>false,            //Optimize code - merge tags
                        ‘Compress’                  => false);    //trims all spaces (line breaks, tabs) between tags and between words.
 
// Specify TIDY configuration
$this->TidyConfig = array(
       ‘indent’         => true, /*a bit slow*/
       ‘output-xhtml’   => true, //Outputs the data in XHTML format
           ‘word-2000?    => false, //Removes all proprietary data when an MS Word document has been saved as HTML
           //’clean’        => true, /*too slow*/
           ‘drop-proprietary-attributes’ =>true, //Removes all attributes that are not part of a web standard
           ‘hide-comments’ => true, //Strips all comments
           ‘preserve-entities’ => true, // preserve the well-formed entitites as found in the input
           ‘quote-ampersand’ => true,//output unadorned & characters as &.
           ‘wrap’           => 200); //Sets the number of characters allowed before a line is soft-wrapped
 
Methods:

Code: Select all

function RemoveBlacklistedAttributes($attribs) //removes specified attributes
function cleanUp($encoding=‘latin1?) //actual cleanup function
Demo (No tidy support on server, unfortunately, so only basic cleaning applies):
http://luci.criosweb.ro/scripts/HTMLCleaner/
HTMLCleaner.rar
HTMLCleaner source code
(10.14 KiB) Downloaded 341 times
User avatar
Zoxive
Forum Regular
Posts: 974
Joined: Fri Apr 01, 2005 4:37 pm
Location: Bay City, Michigan

Re: HTMLCleaner class - clean WYSIWYG generated code

Post by Zoxive »

Similar Project made by a regular here, Ambush Commander.

http://htmlpurifier.org/

Its more of a Standards thing, and not a Microsoft doc Fix.
luci
Forum Newbie
Posts: 2
Joined: Wed Mar 19, 2008 10:35 am

Re: HTMLCleaner class - clean WYSIWYG generated code

Post by luci »

HTMLPurifier is focusing on filtering/error checking HTML while my class is focusing on simplifying/cleaning HTML.
That's the difference. On the other hand HTMLPurifier is a mature project, while CRIOSWEB HTMLCleaner is trying to grow.

Please post oppinions and critiques here...
Post Reply