However, a UTF-8 encoding is specified in the document's XML declaration. Is there any way to force libxml/SimpleXML to parse it as UTF-8?Input is not proper UTF-8, indicate encoding ! Bytes: 0xD7 0x22 0x20 0x74
Parsing UTF-8 files with SimpleXML
Moderator: General Moderators
Parsing UTF-8 files with SimpleXML
I'm unable to parse http://yadan.net/rss/opml.php?act=export (on line 82) with SimpleXML, as I instead get the following error (presumably directly from libxml):
Re: Parsing UTF-8 files with SimpleXML
The utf8 in that feed is broken. The problem line is the one with htmlUrl="http://linmagazine.co.il". They've truncated the description field at 255 bytes, leaving half a utf8-encoded character (that's the 0xD7 in the error message) hanging off the end.
If you want to be generous and make a best effort to parse the bad feed, you can probably run it through a filter that skips past malformed utf8 characters. I don't know offhand of a library for doing it.
If you want to be generous and make a best effort to parse the bad feed, you can probably run it through a filter that skips past malformed utf8 characters. I don't know offhand of a library for doing it.
Re: Parsing UTF-8 files with SimpleXML
Ah, thanks, that explains that.dml wrote:The utf8 in that feed is broken. The problem line is the one with htmlUrl="http://linmagazine.co.il". They've truncated the description field at 255 bytes, leaving half a utf8-encoded character (that's the 0xD7 in the error message) hanging off the end.
Would something like MediaWiki's UtfNormal class work?dml wrote:If you want to be generous and make a best effort to parse the bad feed, you can probably run it through a filter that skips past malformed utf8 characters. I don't know offhand of a library for doing it.
Re: Parsing UTF-8 files with SimpleXML
That mediawiki library looks like it does what's needed. Thanks for mentioning that, I've been looking for a library to do Unicode cleanup/normalisation stuff.
Re: Parsing UTF-8 files with SimpleXML
Great, thanks. Now just to decide whether the benefit of using it outweighs the cost (timewise).dml wrote:That mediawiki library looks like it does what's needed. Thanks for mentioning that, I've been looking for a library to do Unicode cleanup/normalisation stuff.