(X)HTML Parser Needed
Moderator: General Moderators
(X)HTML Parser Needed
I'm currently using SafeHTML to parse HTML from outside sources. However, it has a really bad converter for ampersands (it replaces & by & then replaces & by & which screws up entities).
Does anyone know where I can get a HTML parser that generates valid XHTML, converts & to &, converts named entities to numbered entities and is GPLed?
Thanks for your help,
Ryan McCue
Does anyone know where I can get a HTML parser that generates valid XHTML, converts & to &, converts named entities to numbered entities and is GPLed?
Thanks for your help,
Ryan McCue
Take a look here:
http://hp.jpsband.org/
http://hp.jpsband.org/
Cool, thanks!
I was trying out kses, but this looks so much better.
However, it doesn't convert say à to à but instead converts it back to an à. It also deletes numbered entities like & #0224;
Any ideas?
I was trying out kses, but this looks so much better.
However, it doesn't convert say à to à but instead converts it back to an à. It also deletes numbered entities like & #0224;
Any ideas?
Last edited by rmccue on Thu Oct 05, 2006 6:54 pm, edited 1 time in total.
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Yeah, that's the correct behavior. When you output stuff in UTF-8, you want to use a few entities as possible, so HTML Purifier does that conversion... to the actual character, which is perfectly fine and actually recommended.
It seems like you just want the named entity to be converted into a numeric entity. I could implement something like that, although I won't be able to preserve all the original entities: this is because entities can mask XSS, so they must be parsed.
What numbered entity where you referring to?
It seems like you just want the named entity to be converted into a numeric entity. I could implement something like that, although I won't be able to preserve all the original entities: this is because entities can mask XSS, so they must be parsed.
What numbered entity where you referring to?
I was talking about & #0224;
Also, having named entities converted to numbered would be good, because most named entities must be numbered in XML.
Also, having named entities converted to numbered would be good, because most named entities must be numbered in XML.
Last edited by rmccue on Thu Oct 05, 2006 6:57 pm, edited 1 time in total.
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Hmm, & #0224; seems to work for me. Lower-case a with an accent grave, correct?
Correct. XML only defines five named entities. If you look carefully, however, you'll notice that HTML Purifier's behavior is thus:
Named entity -> Character
Numeric entity -> Character
Character stays a Character
HTML Purifier will never output an entity unless it's <, >, & or ".
I'm not sure what you mean by "having named entities": you're not allowed to have them in XML except the four + apostrophe I mentioned above.
Correct. XML only defines five named entities. If you look carefully, however, you'll notice that HTML Purifier's behavior is thus:
Named entity -> Character
Numeric entity -> Character
Character stays a Character
HTML Purifier will never output an entity unless it's <, >, & or ".
I'm not sure what you mean by "having named entities": you're not allowed to have them in XML except the four + apostrophe I mentioned above.
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Ha, yes, that could be a problem. Note that HTML Purifier is smart enough to strip out control characters from input. A lot of people often forget to do that.
Yes, but not as the code stands. While there really is no reason why they should be using anything besides UTF-8 (iconv, anyone?), here's what you can do.
Do the output as usual. Then, run this function on it:
I'll bundle this functionality with the main library in a later release. This is dumb escaping: EVERYTHING non-ASCII gets escaped.
Yes, but not as the code stands. While there really is no reason why they should be using anything besides UTF-8 (iconv, anyone?), here's what you can do.
Do the output as usual. Then, run this function on it:
Code: Select all
// adapted from utf8ToUnicode by Henri Sivonen and
// hsivonen@iki.fi at <http://iki.fi/hsivonen/php-utf8/>
function escapeNonASCIICharacters($str) {
$mState = 0; // cached expected number of octets after the current octet
// until the beginning of the next UTF8 character sequence
$mUcs4 = 0; // cached Unicode character
$mBytes = 1; // cached expected number of octets in the current sequence
// original code involved an $out that was an array of Unicode
// codepoints. Instead of having to convert back into UTF-8, we've
// decided to directly append valid UTF-8 characters onto a string
// $out once they're done. $char accumulates raw bytes, while $mUcs4
// turns into the Unicode code point, so there's some redundancy.
$out = '';
$char = '';
$len = strlen($str);
for($i = 0; $i < $len; $i++) {
$in = ord($str{$i});
$char .= $str[$i]; // append byte to char
if (0 == $mState) {
// When mState is zero we expect either a US-ASCII character
// or a multi-octet sequence.
if (0 == (0x80 & ($in))) {
// US-ASCII, pass straight through.
if (($in <= 31 || $in == 127) &&
!($in == 9 || $in == 13 || $in == 10) // save \r\t\n
) {
// control characters, remove
} else {
$out .= $char;
}
// reset
$char = '';
$mBytes = 1;
} elseif (0xC0 == (0xE0 & ($in))) {
// First octet of 2 octet sequence
$mUcs4 = ($in);
$mUcs4 = ($mUcs4 & 0x1F) << 6;
$mState = 1;
$mBytes = 2;
} elseif (0xE0 == (0xF0 & ($in))) {
// First octet of 3 octet sequence
$mUcs4 = ($in);
$mUcs4 = ($mUcs4 & 0x0F) << 12;
$mState = 2;
$mBytes = 3;
} elseif (0xF0 == (0xF8 & ($in))) {
// First octet of 4 octet sequence
$mUcs4 = ($in);
$mUcs4 = ($mUcs4 & 0x07) << 18;
$mState = 3;
$mBytes = 4;
} elseif (0xF8 == (0xFC & ($in))) {
// First octet of 5 octet sequence.
//
// This is illegal because the encoded codepoint must be
// either:
// (a) not the shortest form or
// (b) outside the Unicode range of 0-0x10FFFF.
// Rather than trying to resynchronize, we will carry on
// until the end of the sequence and let the later error
// handling code catch it.
$mUcs4 = ($in);
$mUcs4 = ($mUcs4 & 0x03) << 24;
$mState = 4;
$mBytes = 5;
} elseif (0xFC == (0xFE & ($in))) {
// First octet of 6 octet sequence, see comments for 5
// octet sequence.
$mUcs4 = ($in);
$mUcs4 = ($mUcs4 & 1) << 30;
$mState = 5;
$mBytes = 6;
} else {
// Current octet is neither in the US-ASCII range nor a
// legal first octet of a multi-octet sequence.
$mState = 0;
$mUcs4 = 0;
$mBytes = 1;
$char = '';
}
} else {
// When mState is non-zero, we expect a continuation of the
// multi-octet sequence
if (0x80 == (0xC0 & ($in))) {
// Legal continuation.
$shift = ($mState - 1) * 6;
$tmp = $in;
$tmp = ($tmp & 0x0000003F) << $shift;
$mUcs4 |= $tmp;
if (0 == --$mState) {
// End of the multi-octet sequence. mUcs4 now contains
// the final Unicode codepoint to be output
// Check for illegal sequences and codepoints.
// From Unicode 3.1, non-shortest form is illegal
if (((2 == $mBytes) && ($mUcs4 < 0x0080)) ||
((3 == $mBytes) && ($mUcs4 < 0x0800)) ||
((4 == $mBytes) && ($mUcs4 < 0x10000)) ||
(4 < $mBytes) ||
// From Unicode 3.2, surrogate characters = illegal
(($mUcs4 & 0xFFFFF800) == 0xD800) ||
// Codepoints outside the Unicode range are illegal
($mUcs4 > 0x10FFFF)
) {
} elseif (0xFEFF != $mUcs4 && // omit BOM
!($mUcs4 >= 128 && $mUcs4 <= 159) // omit non-SGML
) {
$out .= '&#' . $mUcs4 . ';';
}
// initialize UTF8 cache (reset)
$mState = 0;
$mUcs4 = 0;
$mBytes = 1;
$char = '';
}
} else {
// ((0xC0 & (*in) != 0x80) && (mState != 0))
// Incomplete multi-octet sequence.
// used to result in complete fail, but we'll reset
$mState = 0;
$mUcs4 = 0;
$mBytes = 1;
$char ='';
}
}
}
return $out;
}- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Here it is: viewtopic.php?p=316735Should I make a new topic for that or leave it on this one?
You may bump it since it's about 2 days old.