Parsing UTF-8 files with SimpleXML

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
rmccue
Forum Commoner
Posts: 27
Joined: Thu Oct 05, 2006 12:47 am
Location: Gold Coast, Australia

Parsing UTF-8 files with SimpleXML

Post by rmccue »

I'm unable to parse http://yadan.net/rss/opml.php?act=export (on line 82) with SimpleXML, as I instead get the following error (presumably directly from libxml):
Input is not proper UTF-8, indicate encoding ! Bytes: 0xD7 0x22 0x20 0x74
However, a UTF-8 encoding is specified in the document's XML declaration. Is there any way to force libxml/SimpleXML to parse it as UTF-8?
dml
Forum Contributor
Posts: 133
Joined: Sat Jan 26, 2008 2:20 pm

Re: Parsing UTF-8 files with SimpleXML

Post by dml »

The utf8 in that feed is broken. The problem line is the one with htmlUrl="http://linmagazine.co.il". They've truncated the description field at 255 bytes, leaving half a utf8-encoded character (that's the 0xD7 in the error message) hanging off the end.

If you want to be generous and make a best effort to parse the bad feed, you can probably run it through a filter that skips past malformed utf8 characters. I don't know offhand of a library for doing it.
rmccue
Forum Commoner
Posts: 27
Joined: Thu Oct 05, 2006 12:47 am
Location: Gold Coast, Australia

Re: Parsing UTF-8 files with SimpleXML

Post by rmccue »

dml wrote:The utf8 in that feed is broken. The problem line is the one with htmlUrl="http://linmagazine.co.il". They've truncated the description field at 255 bytes, leaving half a utf8-encoded character (that's the 0xD7 in the error message) hanging off the end.
Ah, thanks, that explains that.
dml wrote:If you want to be generous and make a best effort to parse the bad feed, you can probably run it through a filter that skips past malformed utf8 characters. I don't know offhand of a library for doing it.
Would something like MediaWiki's UtfNormal class work?
dml
Forum Contributor
Posts: 133
Joined: Sat Jan 26, 2008 2:20 pm

Re: Parsing UTF-8 files with SimpleXML

Post by dml »

That mediawiki library looks like it does what's needed. Thanks for mentioning that, I've been looking for a library to do Unicode cleanup/normalisation stuff.
rmccue
Forum Commoner
Posts: 27
Joined: Thu Oct 05, 2006 12:47 am
Location: Gold Coast, Australia

Re: Parsing UTF-8 files with SimpleXML

Post by rmccue »

dml wrote:That mediawiki library looks like it does what's needed. Thanks for mentioning that, I've been looking for a library to do Unicode cleanup/normalisation stuff.
Great, thanks. Now just to decide whether the benefit of using it outweighs the cost (timewise).
Post Reply