Page 1 of 1

DirectLex: Yet another SAX style HTML tokenizer

Posted: Sat Jul 29, 2006 6:49 pm
by Ambush Commander
Aggressively documented, optimized, and fully cognizant of the fact that it can't compare to PHP 5's DOM extension. I'm interested in parts of the code that are difficult to understand, since there's a test suite (comprehensive or not, I don't know) to make sure things are working properly.

Here is some benchmarking info:

Code: Select all

Case            DirectLex	PEARSax3    DOMLex
File: 1.html    100.00%     140.09%     48.71%
File: 2.html    100.00%     588.03%     51.46%
File: 3.html    100.00%     193.15%     84.16%
File: 4.html    100.00%     173.78%      8.61%
<a href="foo>   100.00%     233.47%    118.24%
<a "=>          100.00%    1069.41%    159.53%
Random input    100.00%      23.99%      2.94%
And here is the code:

Code: Select all

<?php

require_once 'HTMLPurifier/Token.php';

/**
 * Forgivingly lexes HTML (SGML-style) markup into tokens.
 * 
 * The lexer parses a string of SGML-style markup and converts them into
 * corresponding tokens.  It doesn't check for well-formedness, although its
 * internal mechanism may make this automatic (such as the case of
 * HTMLPurifier_Lexer_DOMLex).  There are several implementations to choose
 * from.
 * 
 * The lexer is HTML-oriented: it might work with XML, but it's not
 * recommended, as we adhere to a subset of the specification for optimization
 * reasons.
 * 
 * This class should not be directly instantiated, but you may use create() to
 * retrieve a default copy of the lexer.
 * 
 * @note The unit tests will instantiate this class for testing purposes, as
 *       many of the utility functions require a class to be instantiated.
 *       Be careful when porting this class to PHP 5.
 * 
 * @par
 * 
 * @note
 * We use tokens rather than create a DOM representation because DOM would:
 * 
 * @par
 *  -# Require more processing power to create,
 *  -# Require recursion to iterate,
 *  -# Must be compatible with PHP 5's DOM (otherwise duplication),
 *  -# Has the entire document structure (html and body not needed), and
 *  -# Has unknown readability improvement.
 * 
 * @par
 * What the last item means is that the functions for manipulating tokens are
 * already fairly compact, and when well-commented, more abstraction may not
 * be needed.
 * 
 * @see HTMLPurifier_Token
 */
class HTMLPurifier_Lexer
{
    
    /**
     * Lexes an HTML string into tokens.
     * 
     * @param $string String HTML.
     * @return HTMLPurifier_Token array representation of HTML.
     */
    function tokenizeHTML($string) {
        trigger_error('Call to abstract class', E_USER_ERROR);
    }
    
    /**
     * Retrieves or sets the default Lexer as a Prototype Factory.
     * 
     * Depending on what PHP version you are running, the abstract base
     * Lexer class will determine which concrete Lexer is best for you:
     * HTMLPurifier_Lexer_DirectLex for PHP 4, and HTMLPurifier_Lexer_DOMLex
     * for PHP 5 and beyond.
     * 
     * Passing the optional prototype lexer parameter will override the
     * default with your own implementation.  A copy/reference of the prototype
     * lexer will now be returned when you request a new lexer.
     * 
     * @note
     * Though it is possible to call this factory method from subclasses,
     * such usage is not recommended.
     * 
     * @param $prototype Optional prototype lexer.
     * @return Concrete lexer.
     */
    function create($prototype = null) {
        // we don't really care if it's a reference or a copy
        static $lexer = null;
        if ($prototype) {
            $lexer = $prototype;
        }
        if (empty($lexer)) {
            if (version_compare(PHP_VERSION, '5', '>=')) {
                require_once 'HTMLPurifier/Lexer/DOMLex.php';
                $lexer = new HTMLPurifier_Lexer_DOMLex();
            } else {
                require_once 'HTMLPurifier/Lexer/DirectLex.php';
                $lexer = new HTMLPurifier_Lexer_DirectLex();
            }
        }
        return $lexer;
    }
    
    /**
     * Decimal to parsed string conversion table for special entities.
     * @protected
     */
    var $_special_dec2str =
            array(
                    34 => '"',
                    38 => '&',
                    39 => "'",
                    60 => '<',
                    62 => '>'
            );
    
    /**
     * Stripped entity names to decimal conversion table for special entities.
     * @protected
     */
    var $_special_ent2dec =
            array(
                    'quot' => 34,
                    'amp'  => 38,
                    'lt'   => 60,
                    'gt'   => 62
            );
    
    /**
     * Most common entity to raw value conversion table for special entities.
     * @protected
     */
    var $_special_entity2str =
            array(
                    '"' => '"',
                    '&'  => '&',
                    '<'   => '<',
                    '>'   => '>',
                    '''  => "'",
                    ''' => "'",
                    ''' => "'"
            );
    
    /**
     * Callback regex string for parsing entities.
     * @protected
     */                             
    var $_substituteEntitiesRegex =
'/&(?:[#]x([a-fA-F0-9]+)|[#]0*(\d+)|([A-Za-z]+));?/';
//     1. hex             2. dec      3. string
    
    /**
     * Substitutes non-special entities with their parsed equivalents. Since
     * running this whenever you have parsed character is t3h 5uck, we run
     * it before everything else.
     * 
     * @protected
     * @param $string String to have non-special entities parsed.
     * @returns Parsed string.
     */
    function substituteNonSpecialEntities($string) {
        // it will try to detect missing semicolons, but don't rely on it
        return preg_replace_callback(
            $this->_substituteEntitiesRegex,
            array($this, 'nonSpecialEntityCallback'),
            $string
            );
    }
    
    /**
     * Callback function for substituteNonSpecialEntities() that does the work.
     * 
     * @warning Though this is public in order to let the callback happen,
     *          calling it directly is not recommended.
     * @param $matches  PCRE matches array, with 0 the entire match, and
     *                  either index 1, 2 or 3 set with a hex value, dec value,
     *                  or string (respectively).
     * @returns Replacement string.
     * @todo Implement string translations
     */
    function nonSpecialEntityCallback($matches) {
        // replaces all but big five
        $entity = $matches[0];
        $is_num = (@$matches[0][1] === '#');
        if ($is_num) {
            $is_hex = (@$entity[2] === 'x');
            $int = $is_hex ? hexdec($matches[1]) : (int) $matches[2];
            if (isset($this->_special_dec2str[$int]))  return $entity;
            return chr($int);
        } else {
            if (isset($this->_special_ent2dec[$matches[3]])) return $entity;
            if (!$this->_entity_lookup) {
                require_once 'HTMLPurifier/EntityLookup.php';
                $this->_entity_lookup = HTMLPurifier_EntityLookup::instance();
            }
            if (isset($this->_entity_lookup->table[$matches[3]])) {
                return $this->_entity_lookup->table[$matches[3]];
            } else {
                return $entity;
            }
        }
    }
    
    /**
     * Contains a copy of the EntityLookup table.
     * @protected
     */
    var $_entity_lookup;
    
    /**
     * Translates CDATA sections into regular sections (through escaping).
     * 
     * @protected
     * @param $string HTML string to process.
     * @returns HTML with CDATA sections escaped.
     */
    function escapeCDATA($string) {
        return preg_replace_callback(
            '/<!\[CDATA\[(.+?)\]\]>/',
            array('HTMLPurifier_Lexer', 'CDATACallback'),
            $string
        );
    }
    
    /**
     * Callback function for escapeCDATA() that does the work.
     * 
     * @warning Though this is public in order to let the callback happen,
     *          calling it directly is not recommended.
     * @params $matches PCRE matches array, with index 0 the entire match
     *                  and 1 the inside of the CDATA section.
     * @returns Escaped internals of the CDATA section.
     */
    function CDATACallback($matches) {
        // not exactly sure why the character set is needed, but whatever
        return htmlspecialchars($matches[1], ENT_COMPAT, 'UTF-8');
    }
    
}

?>

Code: Select all

<?php

require_once 'HTMLPurifier/Lexer.php';

/**
 * Our in-house implementation of a parser.
 * 
 * A pure PHP parser, DirectLex has absolutely no dependencies, making
 * it a reasonably good default for PHP4.  Written with efficiency in mind,
 * it can be four times faster than HTMLPurifier_Lexer_PEARSax3, although it
 * pales in comparison to HTMLPurifier_Lexer_DOMLex.  It will support UTF-8
 * completely eventually.
 * 
 * @todo Reread XML spec and document differences.
 * @todo Add support for CDATA sections.
 * @todo Determine correct behavior in outputting comment data. (preserve dashes?)
 * @todo Optimize main function tokenizeHTML().
 * @todo Less than sign (<) being prohibited (even as entity) in attr-values?
 */
class HTMLPurifier_Lexer_DirectLex extends HTMLPurifier_Lexer
{
    
    /**
     * Parses special entities into the proper characters.
     * 
     * This string will translate escaped versions of the special characters
     * into the correct ones.
     * 
     * @warning
     * You should be able to treat the output of this function as
     * completely parsed, but that's only because all other entities should
     * have been handled previously in substituteNonSpecialEntities()
     * 
     * @param $string String character data to be parsed.
     * @returns Parsed character data.
     */
    function parseData($string) {
        
        // subtracts amps that cannot possibly be escaped
        $num_amp = substr_count($string, '&') - substr_count($string, '& ') -
            ($string[strlen($string)-1] === '&' ? 1 : 0);
        
        if (!$num_amp) return $string; // abort if no entities
        $num_esc_amp = substr_count($string, '&');
        $string = strtr($string, $this->_special_entity2str);
        
        // code duplication for sake of optimization, see above
        $num_amp_2 = substr_count($string, '&') - substr_count($string, '& ') -
            ($string[strlen($string)-1] === '&' ? 1 : 0);
        
        if ($num_amp_2 <= $num_esc_amp) return $string;
        
        // hmm... now we have some uncommon entities. Use the callback.
        $string = $this->substituteSpecialEntities($string);
        return $string;
    }
    
    /**
     * Whitespace characters for str(c)spn.
     * @protected
     */
    var $_whitespace = "\x20\x09\x0D\x0A";
    
    /**
     * Substitutes only special entities with their parsed equivalents.
     * 
     * @notice We try to avoid calling this function because otherwise, it
     * would have to be called a lot (for every parsed section).
     * 
     * @protected
     * @param $string String to have non-special entities parsed.
     * @returns Parsed string.
     */
    function substituteSpecialEntities($string) {
        return preg_replace_callback(
            $this->_substituteEntitiesRegex,
            array('HTMLPurifier_Lexer_DirectLex', 'specialEntityCallback'),
            $string);
    }
    
    /**
     * Callback function for substituteSpecialEntities() that does the work.
     * 
     * This callback has same syntax as nonSpecialEntityCallback().
     * 
     * @warning Though this is public in order to let the callback happen,
     *          calling it directly is not recommended.
     * @param $matches  PCRE-style matches array, with 0 the entire match, and
     *                  either index 1, 2 or 3 set with a hex value, dec value,
     *                  or string (respectively).
     * @returns Replacement string.
     */
    function specialEntityCallback($matches) {
        $entity = $matches[0];
        $is_num = (@$matches[0][1] === '#');
        if ($is_num) {
            $is_hex = (@$entity[2] === 'x');
            $int = $is_hex ? hexdec($matches[1]) : (int) $matches[2];
            return isset($this->_special_dec2str[$int]) ?
                $this->_special_dec2str[$int] :
                $entity;
        } else {
            return isset($this->_special_ent2dec[$matches[3]]) ?
                $this->_special_ent2dec[$matches[3]] :
                $entity;
        }
    }
    
    function tokenizeHTML($string) {
        
        // some quick checking (if empty, return empty)
        $string = @ (string) $string;
        if ($string == '') return array();
        
        $cursor = 0; // our location in the text
        $inside_tag = false; // whether or not we're parsing the inside of a tag
        $array = array(); // result array
        
        // escape CDATA
        $string = $this->escapeCDATA($string);
        
        // expand entities THAT AREN'T THE BIG FIVE
        $string = $this->substituteNonSpecialEntities($string);
        
        // infinite loop protection
        // has to be pretty big, since html docs can be big
        // we're allow two hundred thousand tags... more than enough?
        $loops = 0;
        
        while(true) {
            
            // infinite loop protection
            if (++$loops > 200000) return array();
            
            $position_next_lt = strpos($string, '<', $cursor);
            $position_next_gt = strpos($string, '>', $cursor);
            
            // triggers on "<b>asdf</b>" but not "asdf <b></b>"
            if ($position_next_lt === $cursor) {
                $inside_tag = true;
                $cursor++;
            }
            
            if (!$inside_tag && $position_next_lt !== false) {
                // We are not inside tag and there still is another tag to parse
                $array[] = new
                    HTMLPurifier_Token_Text(
                        $this->parseData(
                            substr(
                                $string, $cursor, $position_next_lt - $cursor
                            )
                        )
                    );
                $cursor  = $position_next_lt + 1;
                $inside_tag = true;
                continue;
            } elseif (!$inside_tag) {
                // We are not inside tag but there are no more tags
                // If we're already at the end, break
                if ($cursor === strlen($string)) break;
                // Create Text of rest of string
                $array[] = new
                    HTMLPurifier_Token_Text(
                        $this->parseData(
                            substr(
                                $string, $cursor
                            )
                        )
                    );
                break;
            } elseif ($inside_tag && $position_next_gt !== false) {
                // We are in tag and it is well formed
                // Grab the internals of the tag
                $strlen_segment = $position_next_gt - $cursor;
                $segment = substr($string, $cursor, $strlen_segment);
                
                // Check if it's a comment
                if (
                    substr($segment, 0, 3) == '!--' &&
                    substr($segment, $strlen_segment-2, 2) == '--'
                ) {
                    $array[] = new
                        HTMLPurifier_Token_Comment(
                            substr(
                                $segment, 3, $strlen_segment - 5
                            )
                        );
                    $inside_tag = false;
                    $cursor = $position_next_gt + 1;
                    continue;
                }
                
                // Check if it's an end tag
                $is_end_tag = (strpos($segment,'/') === 0);
                if ($is_end_tag) {
                    $type = substr($segment, 1);
                    $array[] = new HTMLPurifier_Token_End($type);
                    $inside_tag = false;
                    $cursor = $position_next_gt + 1;
                    continue;
                }
                
                // Check if it is explicitly self closing, if so, remove
                // trailing slash. Remember, we could have a tag like <br>, so
                // any later token processing scripts must convert improperly
                // classified EmptyTags from StartTags.
                $is_self_closing= (strpos($segment,'/') === $strlen_segment-1);
                if ($is_self_closing) {
                    $strlen_segment--;
                    $segment = substr($segment, 0, $strlen_segment);
                }
                
                // Check if there are any attributes
                $position_first_space = strcspn($segment, $this->_whitespace);
                
                if ($position_first_space >= $strlen_segment) {
                    if ($is_self_closing) {
                        $array[] = new HTMLPurifier_Token_Empty($segment);
                    } else {
                        $array[] = new HTMLPurifier_Token_Start($segment);
                    }
                    $inside_tag = false;
                    $cursor = $position_next_gt + 1;
                    continue;
                }
                
                // Grab out all the data
                $type = substr($segment, 0, $position_first_space);
                $attribute_string =
                    trim(
                        substr(
                            $segment, $position_first_space
                        )
                    );
                if ($attribute_string) {
                    $attributes = $this->parseAttributeString(
                                        $attribute_string
                                  );
                } else {
                    $attributes = array();
                }
                
                if ($is_self_closing) {
                    $array[] = new HTMLPurifier_Token_Empty($type, $attributes);
                } else {
                    $array[] = new HTMLPurifier_Token_Start($type, $attributes);
                }
                $cursor = $position_next_gt + 1;
                $inside_tag = false;
                continue;
            } else {
                $array[] = new
                    HTMLPurifier_Token_Text(
                        '<' .
                        $this->parseData(
                            substr($string, $cursor)
                        )
                    );
                break;
            }
            break;
        }
        return $array;
    }
    
    /**
     * Takes the inside of an HTML tag and makes an assoc array of attributes.
     * 
     * @param $string Inside of tag excluding name.
     * @returns Assoc array of attributes.
     */
    function parseAttributeString($string) {
        $string = (string) $string; // quick typecast
        
        if ($string == '') return array(); // no attributes
        
        // let's see if we can abort as quickly as possible
        // one equal sign, no spaces => one attribute
        $num_equal = substr_count($string, '=');
        $has_space = strpos($string, ' ');
        if ($num_equal === 0 && !$has_space) {
            // bool attribute
            return array($string => $string);
        } elseif ($num_equal === 1 && !$has_space) {
            // only one attribute
            list($key, $quoted_value) = explode('=', $string);
            $quoted_value = trim($quoted_value);
            if (!$key) return array();
            if (!$quoted_value) return array($key => '');
            $first_char = @$quoted_value[0];
            $last_char  = @$quoted_value[strlen($quoted_value)-1];
            
            $same_quote = ($first_char == $last_char);
            $open_quote = ($first_char == '"' || $first_char == "'");
            
            if ( $same_quote && $open_quote) {
                // well behaved
                $value = substr($quoted_value, 1, strlen($quoted_value) - 2);
            } else {
                // not well behaved
                if ($open_quote) {
                    $value = substr($quoted_value, 1);
                } else {
                    $value = $quoted_value;
                }
            }
            return array($key => $value);
        }
        
        // setup loop environment
        $array  = array(); // return assoc array of attributes
        $cursor = 0; // current position in string (moves forward)
        $size   = strlen($string); // size of the string (stays the same)
        
        // if we have unquoted attributes, the parser expects a terminating
        // space, so let's guarantee that there's always a terminating space.
        $string .= ' ';
        
        // infinite loop protection
        $loops = 0;
        
        while(true) {
            
            // infinite loop protection
            if (++$loops > 1000) return array();
            
            if ($cursor >= $size) {
                break;
            }
            
            $cursor += ($value = strspn($string, $this->_whitespace, $cursor));
            
            // grab the key
            
            $key_begin = $cursor; //we're currently at the start of the key
            
            // scroll past all characters that are the key (not whitespace or =)
            $cursor += strcspn($string, $this->_whitespace . '=', $cursor);
            
            $key_end = $cursor; // now at the end of the key
            
            $key = substr($string, $key_begin, $key_end - $key_begin);
            
            if (!$key) continue; // empty key
            
            // scroll past all whitespace
            $cursor += strspn($string, $this->_whitespace, $cursor);
            
            if ($cursor >= $size) {
                $array[$key] = $key;
                break;
            }
            
            // if the next character is an equal sign, we've got a regular
            // pair, otherwise, it's a bool attribute
            $first_char = @$string[$cursor];
            
            if ($first_char == '=') {
                // key="value"
                
                $cursor++;
                $cursor += strspn($string, $this->_whitespace, $cursor);
                
                // we might be in front of a quote right now
                
                $char = @$string[$cursor];
                
                if ($char == '"' || $char == "'") {
                    // it's quoted, end bound is $char
                    $cursor++;
                    $value_begin = $cursor;
                    $cursor = strpos($string, $char, $cursor);
                    $value_end = $cursor;
                } else {
                    // it's not quoted, end bound is whitespace
                    $value_begin = $cursor;
                    $cursor += strcspn($string, $this->_whitespace, $cursor);
                    $value_end = $cursor;
                }
                
                $value = substr($string, $value_begin, $value_end - $value_begin);
                $array[$key] = $this->parseData($value);
                $cursor++;
                
            } else {
                // boolattr
                if ($key !== '') {
                    $array[$key] = $key;
                }
                
            }
        }
        return $array;
    }
    
}

?>

Posted: Sun Jul 30, 2006 7:28 am
by Nathaniel
Dude, that's pretty impressive.

TokenizeHTML and parseAttributeString smell a bit to me. They are doing just one thing, but the methods go on for about a page. Perhaps they would like being extracted into their own classes, or classes with helper classes?

I'd like to see your test suite, out of curiosity :) Again, it looks like you put a lot of work into that.

- Nathaniel

Posted: Sun Jul 30, 2006 9:39 am
by Ambush Commander
Dude, that's pretty impressive.
Thanks. I was, however, a little bummed when I found out that PHP 5's DOM extension did the same thing outrageously more efficiently.
TokenizeHTML and parseAttributeString smell a bit to me. They are doing just one thing, but the methods go on for about a page. Perhaps they would like being extracted into their own classes, or classes with helper classes?
Legitimate concern, I think this would be for performance issues. Because the loop is iterates many times, userspace function calls should be avoided at all costs.

Then again, it would be a good idea in terms of profiling, because it would mean that it would be easier to pinpoint inefficiencies in some of the factored out functions. Right now, the profile looks something like this:

Code: Select all

Function                  Avg. Self  Avg. Cum.  Total Self  Total Cum.   Calls
tokenizeHTML               1,278ms    2,112ms    12,781ms    21,124ms       10
tokenizeAttributeString      1.2ms      1.1ms     6,463ms     5,727ms    5,180
HTMLPurifier_Token_Tag           -          -     1,191ms     1,231ms   15,150
HTMLPurifier_Token_Text          -          -       736ms       764ms   11,030
php:html_entity_decode           -          -       222ms       222ms   11,030
php:strpos                       -          -       205ms       209ms   87,520
php:substr                       -          -       206ms       206ms   76,010
php:strcsp                       -          -        76ms        76ms   15,290
To be quite honest, it's kind of disgusting to see how many times strpos is called (even though this was looped 10 times, that's still a whopping eight thousand seven hundred fifty-two calls).
I'd like to see your test suite, out of curiosity Smile Again, it looks like you put a lot of work into that.
Maybe too much work. Attribute parsing is going to be a biggie too.

Anyway... the test suite:

Code: Select all

<?php

require_once 'HTMLPurifier/Lexer/DirectLex.php';
require_once 'HTMLPurifier/Lexer/PEARSax3.php';

class HTMLPurifier_LexerTest extends UnitTestCase
{
    
    var $Lexer;
    var $DirectLex, $PEARSax3, $DOMLex;
    var $_entity_lookup;
    var $_has_dom;
    
    function setUp() {
        $this->Lexer     = new HTMLPurifier_Lexer();
        
        $this->DirectLex = new HTMLPurifier_Lexer_DirectLex();
        $this->PEARSax3  = new HTMLPurifier_Lexer_PEARSax3();
        
        $this->_has_dom = version_compare(PHP_VERSION, '5', '>=');
        if ($this->_has_dom) {
            require_once 'HTMLPurifier/Lexer/DOMLex.php';
            $this->DOMLex    = new HTMLPurifier_Lexer_DOMLex();
        }
        
        $this->_entity_lookup = HTMLPurifier_EntityLookup::instance();
        
    }
    
    function test_substituteNonSpecialEntities() {
        $char_theta = $this->_entity_lookup->table['theta'];
        $this->assertIdentical($char_theta,
            $this->Lexer->substituteNonSpecialEntities('&theta;') );
        $this->assertIdentical('"',
            $this->Lexer->substituteNonSpecialEntities('"') );
    }
    
    function test_tokenizeHTML() {
        
        $input = array();
        $expect = array();
        $sax_expect = array();
        
        $input[0] = '';
        $expect[0] = array();
        
        $input[1] = 'This is regular text.';
        $expect[1] = array(
            new HTMLPurifier_Token_Text('This is regular text.')
            );
        
        $input[2] = 'This is <b>bold</b> text';
        $expect[2] = array(
            new HTMLPurifier_Token_Text('This is ')
           ,new HTMLPurifier_Token_Start('b', array())
           ,new HTMLPurifier_Token_Text('bold')
           ,new HTMLPurifier_Token_End('b')
           ,new HTMLPurifier_Token_Text(' text')
            );
        
        $input[3] = '<DIV>Totally rad dude. <b>asdf</b></div>';
        $expect[3] = array(
            new HTMLPurifier_Token_Start('DIV', array())
           ,new HTMLPurifier_Token_Text('Totally rad dude. ')
           ,new HTMLPurifier_Token_Start('b', array())
           ,new HTMLPurifier_Token_Text('asdf')
           ,new HTMLPurifier_Token_End('b')
           ,new HTMLPurifier_Token_End('div')
            );
        
        // [XML-INVALID]
        $input[4] = '<asdf></asdf><d></d><poOloka><poolasdf><ds></asdf></ASDF>';
        $expect[4] = array(
            new HTMLPurifier_Token_Start('asdf')
           ,new HTMLPurifier_Token_End('asdf')
           ,new HTMLPurifier_Token_Start('d')
           ,new HTMLPurifier_Token_End('d')
           ,new HTMLPurifier_Token_Start('poOloka')
           ,new HTMLPurifier_Token_Start('poolasdf')
           ,new HTMLPurifier_Token_Start('ds')
           ,new HTMLPurifier_Token_End('asdf')
           ,new HTMLPurifier_Token_End('ASDF')
            );
        // DOM is different because it condenses empty tags into REAL empty ones
        // as well as makes it well-formed
        $dom_expect[4] = array(
            new HTMLPurifier_Token_Empty('asdf')
           ,new HTMLPurifier_Token_Empty('d')
           ,new HTMLPurifier_Token_Start('pooloka')
           ,new HTMLPurifier_Token_Start('poolasdf')
           ,new HTMLPurifier_Token_Empty('ds')
           ,new HTMLPurifier_Token_End('poolasdf')
           ,new HTMLPurifier_Token_End('pooloka')
            );
        
        $input[5] = '<a'."\t".'href="foobar.php"'."\n".'title="foo!">Link to <b id="asdf">foobar</b></a>';
        $expect[5] = array(
            new HTMLPurifier_Token_Start('a',array('href'=>'foobar.php','title'=>'foo!'))
           ,new HTMLPurifier_Token_Text('Link to ')
           ,new HTMLPurifier_Token_Start('b',array('id'=>'asdf'))
           ,new HTMLPurifier_Token_Text('foobar')
           ,new HTMLPurifier_Token_End('b')
           ,new HTMLPurifier_Token_End('a')
            );
        
        $input[6] = '<br />';
        $expect[6] = array(
            new HTMLPurifier_Token_Empty('br')
            );
        
        // [SGML-INVALID] [RECOVERABLE]
        $input[7] = '<!-- Comment --> <!-- not so well formed --->';
        $expect[7] = array(
            new HTMLPurifier_Token_Comment(' Comment ')
           ,new HTMLPurifier_Token_Text(' ')
           ,new HTMLPurifier_Token_Comment(' not so well formed -')
            );
        $sax_expect[7] = false; // we need to figure out proper comment output
        
        // [SGML-INVALID]
        $input[8] = '<a href=""';
        $expect[8] = array(
            new HTMLPurifier_Token_Text('<a href=""')
            );
        // SAX parses it into a tag
        $sax_expect[8] = array(
            new HTMLPurifier_Token_Start('a', array('href'=>''))
            ); 
        // DOM parses it into an empty tag
        $dom_expect[8] = array(
            new HTMLPurifier_Token_Empty('a', array('href'=>''))
            ); 
        
        $input[9] = '<b>';
        $expect[9] = array(
            new HTMLPurifier_Token_Text('<b>')
            );
        $sax_expect[9] = array(
            new HTMLPurifier_Token_Text('<')
           ,new HTMLPurifier_Token_Text('b')
           ,new HTMLPurifier_Token_Text('>')
            );
        // note that SAX can clump text nodes together. We won't be
        // too picky though
        
        // [SGML-INVALID]
        $input[10] = '<a "=>';
        // We barf on this, aim for no attributes
        $expect[10] = array(
            new HTMLPurifier_Token_Start('a', array('"' => ''))
            );
        // DOM correctly has no attributes, but also closes the tag
        $dom_expect[10] = array(
            new HTMLPurifier_Token_Empty('a')
            );
        // SAX barfs on this
        $sax_expect[10] = array(
            new HTMLPurifier_Token_Start('a', array('"' => ''))
            );
        
        // [INVALID] [RECOVERABLE]
        $input[11] = '"';
        $expect[11] = array( new HTMLPurifier_Token_Text('"') );
        
        // compare with this valid one:
        $input[12] = '"';
        $expect[12] = array( new HTMLPurifier_Token_Text('"') );
        $sax_expect[12] = false; // choked!
        
        // CDATA sections!
        $input[13] = '<![CDATA[You <b>can't</b> get me!]]>';
        $expect[13] = array( new HTMLPurifier_Token_Text(
            'You <b>can't</b> get me!' // raw
            ) );
        $sax_expect[13] = array( // SAX has a seperate call for each entity
                new HTMLPurifier_Token_Text('You '),
                new HTMLPurifier_Token_Text('<'),
                new HTMLPurifier_Token_Text('b'),
                new HTMLPurifier_Token_Text('>'),
                new HTMLPurifier_Token_Text('can'),
                new HTMLPurifier_Token_Text('&'),
                new HTMLPurifier_Token_Text('#39;t'),
                new HTMLPurifier_Token_Text('<'),
                new HTMLPurifier_Token_Text('/b'),
                new HTMLPurifier_Token_Text('>'),
                new HTMLPurifier_Token_Text(' get me!')
            );
        
        $char_theta = $this->_entity_lookup->table['theta'];
        $char_rarr  = $this->_entity_lookup->table['rarr'];
        
        // test entity replacement
        $input[14] = '&theta;';
        $expect[14] = array( new HTMLPurifier_Token_Text($char_theta) );
        
        // test that entities aren't replaced in CDATA sections
        $input[15] = '&theta; <![CDATA[&rarr;]]>';
        $expect[15] = array( new HTMLPurifier_Token_Text($char_theta . ' &rarr;') );
        $sax_expect[15] = array(
                new HTMLPurifier_Token_Text($char_theta . ' '),
                new HTMLPurifier_Token_Text('&'),
                new HTMLPurifier_Token_Text('rarr;')
            );
        
        foreach($input as $i => $discard) {
            $result = $this->DirectLex->tokenizeHTML($input[$i]);
            $this->assertEqual($expect[$i], $result, 'DirectLexTest '.$i.': %s');
            paintIf($result, $expect[$i] != $result);
            
            // assert unless I say otherwise
            $sax_result = $this->PEARSax3->tokenizeHTML($input[$i]);
            if (!isset($sax_expect[$i])) {
                // by default, assert with normal result
                $this->assertEqual($expect[$i], $sax_result, 'PEARSax3Test '.$i.': %s');
                paintIf($sax_result, $expect[$i] != $sax_result);
            } elseif ($sax_expect[$i] === false) {
                // assertions were turned off, optionally dump
                // paintIf($sax_expect, $i == NUMBER);
            } else {
                // match with a custom SAX result array
                $this->assertEqual($sax_expect[$i], $sax_result, 'PEARSax3Test (custom) '.$i.': %s');
                paintIf($sax_result, $sax_expect[$i] != $sax_result);
            }
            if ($this->_has_dom) {
                $dom_result = $this->DOMLex->tokenizeHTML($input[$i]);
                // same structure as SAX
                if (!isset($dom_expect[$i])) {
                    $this->assertEqual($expect[$i], $dom_result, 'DOMLexTest '.$i.': %s');
                    paintIf($dom_result, $expect[$i] != $dom_result);
                } elseif ($dom_expect[$i] === false) {
                    // paintIf($dom_result, $i == NUMBER);
                } else {
                    $this->assertEqual($dom_expect[$i], $dom_result, 'DOMLexTest (custom) '.$i.': %s');
                    paintIf($dom_result, $dom_expect[$i] != $dom_result);
                }
            }
            
        }
        
    }
    
}

?>

Posted: Sun Jul 30, 2006 11:15 am
by neophyte
Wow Ambush Commander! That's awesome. I'll try taking some time to look at it more later tonight. Sweet code. I've never done anything that complex before with strings. I've always simply relied on guys like you to write the library.

Nice work!

Posted: Sun Jul 30, 2006 11:19 am
by Ambush Commander
Thanks!

Well, there still are a few missing parts, so the code won't just "run". You need EntityLookup.php, Token.php, and generate-entity-file.php (which are kind of big, so I won't post them).