Character sets with SimpleXML

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
swan52
Forum Newbie
Posts: 14
Joined: Fri Jul 09, 2010 1:58 pm

Character sets with SimpleXML

Post by swan52 »

Hi I have been using SimpleXML to copy data from xml files into a database for some time now however I recently discovered to my dismay that the certain letters were not being copied correctly, for example cities in Europe listed like say Munich (München) display as München.
Is there a way to include the correct character sets or is this a problem with SimpleXML?
User avatar
Apollo
Forum Regular
Posts: 794
Joined: Wed Apr 30, 2008 2:34 am

Re: Character sets with SimpleXML

Post by Apollo »

What kind of encoding does the header in your .xml file specify, and is the actual content indeed encoded using that particular encoding?

And is that same encoding also used in your database, and in the HTML you output when retrieving + displaying contents from the database later?
swan52
Forum Newbie
Posts: 14
Joined: Fri Jul 09, 2010 1:58 pm

Re: Character sets with SimpleXML

Post by swan52 »

The .xml encoding is ISO 8859-1.
When I simply to echo the results from the xml the characters already appear 'corrupted' so I do not believe it is the database and I can copy paste the records manually into db without any issues.
User avatar
StathisG
Forum Newbie
Posts: 14
Joined: Sat Mar 13, 2010 7:15 pm
Location: UK

Re: Character sets with SimpleXML

Post by StathisG »

swan52 wrote:When I simply to echo the results from the xml the characters already appear 'corrupted'
Try to echo the results in a page with UTF-8 charset.
User avatar
Apollo
Forum Regular
Posts: 794
Joined: Wed Apr 30, 2008 2:34 am

Re: Character sets with SimpleXML

Post by Apollo »

swan52 wrote:The .xml encoding is ISO 8859-1.
What do you mean: does the .xml header specify iso-8859-1 encoding, and/or is the actual content itself iso-8859-1 encoded? (ideally, both would be true, but is a vast majority of f*ckups who completely ignore the whole encoding concept or never even heard of it, and just assume everything magically works out by itself...)
When I simply to echo the results from the xml the characters already appear 'corrupted' so I do not believe it is the database and I can copy paste the records manually into db without any issues.
What happens if you read the xml content into a string $s and then echo bin2hex($s) instead of echo'ing $s directly? (especially in case of a string with funny characters, such as "München").

What encoding does your HTML page use? Also iso-8859-1 ?
swan52
Forum Newbie
Posts: 14
Joined: Fri Jul 09, 2010 1:58 pm

Re: Character sets with SimpleXML

Post by swan52 »

Apollo wrote:
swan52 wrote:The .xml encoding is ISO 8859-1.
What do you mean: does the .xml header specify iso-8859-1 encoding, and/or is the actual content itself iso-8859-1 encoded? (ideally, both would be true, but is a vast majority of f*ckups who completely ignore the whole encoding concept or never even heard of it, and just assume everything magically works out by itself...)
Yes, the xml header defines the encoding in this way-

<?xml version="1.0" encoding="ISO-8859-1"?>
When I simply to echo the results from the xml the characters already appear 'corrupted' so I do not believe it is the database and I can copy paste the records manually into db without any issues.
What happens if you read the xml content into a string $s and then echo bin2hex($s) instead of echo'ing $s directly? (especially in case of a string with funny characters, such as "München").

What encoding does your HTML page use? Also iso-8859-1 ?
if I echo using bin2hex it gives the hex code of the word in numbers and letters - 4dfc6e6368656e, if I directly type echo "München"; in PHP it echos... München and as I was merely using SimpleXML to copy the data there is no page layout because one wasn't needed. I believe the issue to be with SimpleXML however I am not aware of a way to use character sets hence my predicament.
Last edited by swan52 on Wed Jul 21, 2010 7:29 am, edited 2 times in total.
User avatar
Apollo
Forum Regular
Posts: 794
Joined: Wed Apr 30, 2008 2:34 am

Re: Character sets with SimpleXML

Post by Apollo »

swan52 wrote:Yes, the xml header defines the encoding in this way-

<?xml version="1.0" encoding="ISO-8859-1"?>

if I echo using bin2hex it gives the hex code of the word in numbers and letters - 4dfc6e6368656e
Ok, that is indeed the iso-8859-1 encoding for 'München', so at least the .xml (and the data that SimpleXML reads from it) is correct.
if I directly type echo "München"; in PHP it echos... München

Well there is no such thing as 'directly typing echo "München"; in PHP', because .php files are binary (as in, not officially confirming to any specific encoding). So if you put something in there that looks like 'München' in your editor, php will simply output whichever raw bytes your editor decides to store for that string (which will be most likely utf-8 unicode or windows-1252 ansi).

Try echo'ing this html header first:

Code: Select all

<html><head><meta http-equiv='Content-Type' content='text/html; charset=iso-8859-1' /></head><body>
And then echo the data you read with SimpleXML. Does that help? (if not, can you echo bin2hex($result_from_SimpleXML) again?)
swan52
Forum Newbie
Posts: 14
Joined: Fri Jul 09, 2010 1:58 pm

Re: Character sets with SimpleXML

Post by swan52 »

Hmm I'm not entirely sure what you mean, if I break it into its simplest form I have an xml like so -

Code: Select all

<?xml version="1.0" encoding="ISO-8859-1"?>
<file>
  <Placemark>
    <ctr>München</ctr>
    <ctdtl>Rodríguez</ctdtl>
  </Placemark>
</file>
and a php that tries to read the file - (including the bin2hex you mentioned)

Code: Select all

<?php

$xml = simplexml_load_file("test.xml");

echo bin2hex($xml->Placemark[0]->ctr[0])."<br>";
echo $xml->Placemark[0]->ctr[0]."<br>";
echo bin2hex($xml->Placemark[0]->ctdtl[0])."<br>";
echo $xml->Placemark[0]->ctdtl[0]."<br>";

?>
Running this, the results differ from the original file. If I change the html layout it doesn't appear to make any difference because the result is already corrupted. Is there something in PHP or SimpleXML that I am missing?
User avatar
Apollo
Forum Regular
Posts: 794
Joined: Wed Apr 30, 2008 2:34 am

Re: Character sets with SimpleXML

Post by Apollo »

swan52 wrote:Running this, the results differ from the original file.
What results do you get from this php?
swan52
Forum Newbie
Posts: 14
Joined: Fri Jul 09, 2010 1:58 pm

Re: Character sets with SimpleXML

Post by swan52 »

Apollo wrote:
swan52 wrote:Running this, the results differ from the original file.
What results do you get from this php?
204dc3bc6e6368656e
München
526f6472c3ad6775657a
Rodríguez
swan52
Forum Newbie
Posts: 14
Joined: Fri Jul 09, 2010 1:58 pm

Re: Character sets with SimpleXML

Post by swan52 »

Finally solved it, it's been bugging me for days now. I thought it was simple but could not find any documentation on this...

and it was that simple, in case anyone is interested here's the erm.. fix -

Code: Select all

<?php

header('Content-Type: text/html; charset=utf-8');

$xml = simplexml_load_file("test.xml");

echo bin2hex($xml->Placemark[0]->ctr[0])."<br>";
echo $xml->Placemark[0]->ctr[0]."<br>";
echo bin2hex($xml->Placemark[0]->ctdtl[0])."<br>";
echo $xml->Placemark[0]->ctdtl[0]."<br>";

?>
:banghead:
User avatar
Apollo
Forum Regular
Posts: 794
Joined: Wed Apr 30, 2008 2:34 am

Re: Character sets with SimpleXML

Post by Apollo »

swan52 wrote:
Apollo wrote:What results do you get from this php?
204dc3bc6e6368656e
München
526f6472c3ad6775657a
Rodríguez
Well, then either simpleXML or your .xml file is still messed up. Earlier on you mentioned that bin2hex outputted '4dfc6e6368656e' for München, note the fc for 'ü' (this is iso-8859-1). Now the 'ü' is binary c3bc (which is utf-8). So from an .xml file that claims to be iso-8859-1, you're getting utf-8 encoded data, which is not correct.

But if this behavior is constant, you can indeed work around it the way you did. As long as you can assume that the result you get from simplexml is utf-8 even though the xml is iso-8859-1 (which is strange, but apparently this is the case), you can just output a utf-8 HTML header (as you did in your last post) and output the strings directly.
swan52
Forum Newbie
Posts: 14
Joined: Fri Jul 09, 2010 1:58 pm

Re: Character sets with SimpleXML

Post by swan52 »

Apollo wrote:
swan52 wrote:
Apollo wrote:What results do you get from this php?
204dc3bc6e6368656e
München
526f6472c3ad6775657a
Rodríguez
Well, then either simpleXML or your .xml file is still messed up. Earlier on you mentioned that bin2hex outputted '4dfc6e6368656e' for München, note the fc for 'ü' (this is iso-8859-1). Now the 'ü' is binary c3bc (which is utf-8). So from an .xml file that claims to be iso-8859-1, you're getting utf-8 encoded data, which is not correct.

But if this behavior is constant, you can indeed work around it the way you did. As long as you can assume that the result you get from simplexml is utf-8 even though the xml is iso-8859-1 (which is strange, but apparently this is the case), you can just output a utf-8 HTML header (as you did in your last post) and output the strings directly.
My apologies I got the first Munich instance from a different xml file I was no longer using and that was a screw up on my part. As far as I can see everything now displays as it should when I use the utf-8 charset. I was just previously unaware as to how to implement it but thanks so much for helping.
Post Reply