HTML entities, the character ones (not decimal/hex)

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

HTML entities, the character ones (not decimal/hex)

Post by Ambush Commander »

I don't normally post on Code but this isn't deep enough to warrant a Theory post. I just want to confirm my hunch.

I've built an SGML lexer that parses SGML/XML documents in a very forgiving manner. (It's sort of like PEAR's XML_HTMLSax3, except it's faster. They're both compatible actually. Almost.) The tokens it funnels raw text into should be parsed, that is, ALL entities expanded into the UTF-8 forms (we're using UTF-8 internally, maybe UTF-16 would have been easier, but it didn't make sense to have to convert the whole thing into a non-ASCII compatible format).

However, I'm having trouble implementing it with html_entity_decode(). It's a bit of a mystery meat sausage function, this html_entity_decode(). It only properly converts — into it's UTF-8 form when the proper character encoding is passed, but, of course, the function doesn't work properly in PHP 4 with UTF-8 (compatibility with PHP 4 is a major goal, otherwise, I would have piggy-backed off a lot of the stuff in PHP 5).

Most of the workarounds only deal with decimal and hex entities, but entities like — aren't managed. I'm wondering if there's a way to fix this in PHP 4 WITHOUT external libraries.

I don't think there is.

(gets ready to create a lookup array of ~250 character entities for the conversion)
User avatar
Jenk
DevNet Master
Posts: 3587
Joined: Mon Sep 19, 2005 6:24 am
Location: London

Post by Jenk »

Perhaps utf8_decode > htm_entities_decode > utf8_encode ?
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Nope. utf8_decode discards all multibyte characters and puts a '?' in their place.
Post Reply