Page 1 of 1

HTML entities, the character ones (not decimal/hex)

Posted: Wed Jul 19, 2006 9:35 pm
by Ambush Commander
I don't normally post on Code but this isn't deep enough to warrant a Theory post. I just want to confirm my hunch.

I've built an SGML lexer that parses SGML/XML documents in a very forgiving manner. (It's sort of like PEAR's XML_HTMLSax3, except it's faster. They're both compatible actually. Almost.) The tokens it funnels raw text into should be parsed, that is, ALL entities expanded into the UTF-8 forms (we're using UTF-8 internally, maybe UTF-16 would have been easier, but it didn't make sense to have to convert the whole thing into a non-ASCII compatible format).

However, I'm having trouble implementing it with html_entity_decode(). It's a bit of a mystery meat sausage function, this html_entity_decode(). It only properly converts — into it's UTF-8 form when the proper character encoding is passed, but, of course, the function doesn't work properly in PHP 4 with UTF-8 (compatibility with PHP 4 is a major goal, otherwise, I would have piggy-backed off a lot of the stuff in PHP 5).

Most of the workarounds only deal with decimal and hex entities, but entities like — aren't managed. I'm wondering if there's a way to fix this in PHP 4 WITHOUT external libraries.

I don't think there is.

(gets ready to create a lookup array of ~250 character entities for the conversion)

Posted: Thu Jul 20, 2006 6:47 am
by Jenk
Perhaps utf8_decode > htm_entities_decode > utf8_encode ?

Posted: Thu Jul 20, 2006 6:53 am
by Ambush Commander
Nope. utf8_decode discards all multibyte characters and puts a '?' in their place.