Page 1 of 1
Parsing UTF-8 files with SimpleXML
Posted: Thu Nov 27, 2008 6:05 am
by rmccue
I'm unable to parse
http://yadan.net/rss/opml.php?act=export (on line 82) with SimpleXML, as I instead get the following error (presumably directly from libxml):
Input is not proper UTF-8, indicate encoding ! Bytes: 0xD7 0x22 0x20 0x74
However, a UTF-8 encoding is specified in the document's XML declaration. Is there any way to force libxml/SimpleXML to parse it as UTF-8?
Re: Parsing UTF-8 files with SimpleXML
Posted: Thu Nov 27, 2008 2:11 pm
by dml
The utf8 in that feed is broken. The problem line is the one with htmlUrl="
http://linmagazine.co.il". They've truncated the description field at 255 bytes, leaving half a utf8-encoded character (that's the 0xD7 in the error message) hanging off the end.
If you want to be generous and make a best effort to parse the bad feed, you can probably run it through a filter that skips past malformed utf8 characters. I don't know offhand of a library for doing it.
Re: Parsing UTF-8 files with SimpleXML
Posted: Fri Nov 28, 2008 3:37 am
by rmccue
dml wrote:The utf8 in that feed is broken. The problem line is the one with htmlUrl="
http://linmagazine.co.il". They've truncated the description field at 255 bytes, leaving half a utf8-encoded character (that's the 0xD7 in the error message) hanging off the end.
Ah, thanks, that explains that.
dml wrote:If you want to be generous and make a best effort to parse the bad feed, you can probably run it through a filter that skips past malformed utf8 characters. I don't know offhand of a library for doing it.
Would something like MediaWiki's
UtfNormal class work?
Re: Parsing UTF-8 files with SimpleXML
Posted: Fri Nov 28, 2008 3:02 pm
by dml
That mediawiki library looks like it does what's needed. Thanks for mentioning that, I've been looking for a library to do Unicode cleanup/normalisation stuff.
Re: Parsing UTF-8 files with SimpleXML
Posted: Sat Nov 29, 2008 10:21 pm
by rmccue
dml wrote:That mediawiki library looks like it does what's needed. Thanks for mentioning that, I've been looking for a library to do Unicode cleanup/normalisation stuff.
Great, thanks. Now just to decide whether the benefit of using it outweighs the cost (timewise).