Page 1 of 1

(X)HTML Parser Needed

Posted: Thu Oct 05, 2006 12:55 am
by rmccue
I'm currently using SafeHTML to parse HTML from outside sources. However, it has a really bad converter for ampersands (it replaces & by & then replaces & by & which screws up entities).
Does anyone know where I can get a HTML parser that generates valid XHTML, converts & to &, converts named entities to numbered entities and is GPLed?
Thanks for your help,
Ryan McCue

Posted: Thu Oct 05, 2006 9:56 am
by Luke
Take a look here:
http://hp.jpsband.org/

Posted: Thu Oct 05, 2006 6:18 pm
by rmccue
Cool, thanks!
I was trying out kses, but this looks so much better.
However, it doesn't convert say à to à but instead converts it back to an à. It also deletes numbered entities like & #0224;
Any ideas?

Posted: Thu Oct 05, 2006 6:22 pm
by Luke
well the creator of the library is a regular poster here... he'll surely be around soon to help you... I'll drop him a PM for ya.

Posted: Thu Oct 05, 2006 6:31 pm
by rmccue
Cool, thanks!
BTW if anyone wanted to know, I'm using it in Lilina

Posted: Thu Oct 05, 2006 6:31 pm
by Ambush Commander
Yeah, that's the correct behavior. When you output stuff in UTF-8, you want to use a few entities as possible, so HTML Purifier does that conversion... to the actual character, which is perfectly fine and actually recommended.

It seems like you just want the named entity to be converted into a numeric entity. I could implement something like that, although I won't be able to preserve all the original entities: this is because entities can mask XSS, so they must be parsed.

What numbered entity where you referring to?

Posted: Thu Oct 05, 2006 6:43 pm
by rmccue
I was talking about & #0224;
Also, having named entities converted to numbered would be good, because most named entities must be numbered in XML.

Posted: Thu Oct 05, 2006 6:49 pm
by Ambush Commander
Hmm, & #0224; seems to work for me. Lower-case a with an accent grave, correct?

Correct. XML only defines five named entities. If you look carefully, however, you'll notice that HTML Purifier's behavior is thus:

Named entity -> Character
Numeric entity -> Character
Character stays a Character

HTML Purifier will never output an entity unless it's <, >, & or ".

I'm not sure what you mean by "having named entities": you're not allowed to have them in XML except the four + apostrophe I mentioned above.

Posted: Thu Oct 05, 2006 6:53 pm
by rmccue
What I mean is, it converts entities back to characters, where as I would prefer them to remain as numbered entities, due to the fact that some of my users may use a different encoding.
Is this possible?
BTW I was accidently using & #024; :D

Posted: Thu Oct 05, 2006 7:03 pm
by Ambush Commander
Ha, yes, that could be a problem. Note that HTML Purifier is smart enough to strip out control characters from input. A lot of people often forget to do that.

Yes, but not as the code stands. While there really is no reason why they should be using anything besides UTF-8 (iconv, anyone?), here's what you can do.

Do the output as usual. Then, run this function on it:

Code: Select all

// adapted from utf8ToUnicode by Henri Sivonen and
// hsivonen@iki.fi at <http://iki.fi/hsivonen/php-utf8/>
function escapeNonASCIICharacters($str) {
    
    $mState = 0; // cached expected number of octets after the current octet
                 // until the beginning of the next UTF8 character sequence
    $mUcs4  = 0; // cached Unicode character
    $mBytes = 1; // cached expected number of octets in the current sequence
    
    // original code involved an $out that was an array of Unicode
    // codepoints.  Instead of having to convert back into UTF-8, we've
    // decided to directly append valid UTF-8 characters onto a string
    // $out once they're done.  $char accumulates raw bytes, while $mUcs4
    // turns into the Unicode code point, so there's some redundancy.
    
    $out = '';
    $char = '';
    
    $len = strlen($str);
    for($i = 0; $i < $len; $i++) {
        $in = ord($str{$i});
        $char .= $str[$i]; // append byte to char
        if (0 == $mState) {
            // When mState is zero we expect either a US-ASCII character 
            // or a multi-octet sequence.
            if (0 == (0x80 & ($in))) {
                // US-ASCII, pass straight through.
                if (($in <= 31 || $in == 127) && 
                    !($in == 9 || $in == 13 || $in == 10) // save \r\t\n
                ) {
                    // control characters, remove
                } else {
                    $out .= $char;
                }
                // reset
                $char = '';
                $mBytes = 1;
            } elseif (0xC0 == (0xE0 & ($in))) {
                // First octet of 2 octet sequence
                $mUcs4 = ($in);
                $mUcs4 = ($mUcs4 & 0x1F) << 6;
                $mState = 1;
                $mBytes = 2;
            } elseif (0xE0 == (0xF0 & ($in))) {
                // First octet of 3 octet sequence
                $mUcs4 = ($in);
                $mUcs4 = ($mUcs4 & 0x0F) << 12;
                $mState = 2;
                $mBytes = 3;
            } elseif (0xF0 == (0xF8 & ($in))) {
                // First octet of 4 octet sequence
                $mUcs4 = ($in);
                $mUcs4 = ($mUcs4 & 0x07) << 18;
                $mState = 3;
                $mBytes = 4;
            } elseif (0xF8 == (0xFC & ($in))) {
                // First octet of 5 octet sequence.
                // 
                // This is illegal because the encoded codepoint must be 
                // either:
                // (a) not the shortest form or
                // (b) outside the Unicode range of 0-0x10FFFF.
                // Rather than trying to resynchronize, we will carry on 
                // until the end of the sequence and let the later error
                // handling code catch it.
                $mUcs4 = ($in);
                $mUcs4 = ($mUcs4 & 0x03) << 24;
                $mState = 4;
                $mBytes = 5;
            } elseif (0xFC == (0xFE & ($in))) {
                // First octet of 6 octet sequence, see comments for 5
                // octet sequence.
                $mUcs4 = ($in);
                $mUcs4 = ($mUcs4 & 1) << 30;
                $mState = 5;
                $mBytes = 6;
            } else {
                // Current octet is neither in the US-ASCII range nor a 
                // legal first octet of a multi-octet sequence.
                $mState = 0;
                $mUcs4  = 0;
                $mBytes = 1;
                $char = '';
            }
        } else {
            // When mState is non-zero, we expect a continuation of the
            // multi-octet sequence
            if (0x80 == (0xC0 & ($in))) {
                // Legal continuation.
                $shift = ($mState - 1) * 6;
                $tmp = $in;
                $tmp = ($tmp & 0x0000003F) << $shift;
                $mUcs4 |= $tmp;
                
                if (0 == --$mState) {
                    // End of the multi-octet sequence. mUcs4 now contains
                    // the final Unicode codepoint to be output
                    
                    // Check for illegal sequences and codepoints.
                    
                    // From Unicode 3.1, non-shortest form is illegal
                    if (((2 == $mBytes) && ($mUcs4 < 0x0080)) ||
                        ((3 == $mBytes) && ($mUcs4 < 0x0800)) ||
                        ((4 == $mBytes) && ($mUcs4 < 0x10000)) ||
                        (4 < $mBytes) ||
                        // From Unicode 3.2, surrogate characters = illegal
                        (($mUcs4 & 0xFFFFF800) == 0xD800) ||
                        // Codepoints outside the Unicode range are illegal
                        ($mUcs4 > 0x10FFFF)
                    ) {
                        
                    } elseif (0xFEFF != $mUcs4 && // omit BOM
                        !($mUcs4 >= 128 && $mUcs4 <= 159) // omit non-SGML
                    ) {
                        $out .= '&#' . $mUcs4 . ';';
                    }
                    // initialize UTF8 cache (reset)
                    $mState = 0;
                    $mUcs4  = 0;
                    $mBytes = 1;
                    $char = '';
                }
            } else {
                // ((0xC0 & (*in) != 0x80) && (mState != 0))
                // Incomplete multi-octet sequence.
                // used to result in complete fail, but we'll reset
                $mState = 0;
                $mUcs4  = 0;
                $mBytes = 1;
                $char ='';
            }
        }
    }
    return $out;
}
I'll bundle this functionality with the main library in a later release. This is dumb escaping: EVERYTHING non-ASCII gets escaped.

Posted: Thu Oct 05, 2006 7:11 pm
by rmccue
Thanks! *throws in lib.php*
Would I still need to use this if it is UTF-8?

Posted: Thu Oct 05, 2006 7:13 pm
by Ambush Commander
Definitely not. The function isn't fast: after all, it has to analyze the string byte by byte, so if you can get away with not using it, by all means do so.

Posted: Sun Oct 08, 2006 4:17 am
by rmccue
Should I make a new topic for that or leave it on this one?

Posted: Sun Oct 08, 2006 5:01 am
by Weirdan
Should I make a new topic for that or leave it on this one?
Here it is: viewtopic.php?p=316735
You may bump it since it's about 2 days old.