Character Entities/XML Parsing -- Please help.

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
opengavel
Forum Newbie
Posts: 10
Joined: Tue Sep 12, 2006 2:06 pm
Location: Chicago

Character Entities/XML Parsing -- Please help.

Post by opengavel »

feyd | Please use

Code: Select all

,

Code: Select all

and [syntax="..."] tags where appropriate when posting code. Your post has been edited to reflect how we'd like it posted. Please read:  [url=http://forums.devnetwork.net/viewtopic.php?t=21171]Posting Code in the Forums[/url] to learn how to do it too.[/color]


I am having a problem with the "&" character entity when parsing an XML document with PHP.

I want the script to parse a document for two items, so they can be loaded into a MYSQL DB.

The problem is the document title is the name of a lawsuit, so it may (or may not) include the amp character entity. When the parser encounters this entity it messes up the parser. There may be an obvious solution, as I am not that familar with either character encoding or XML parsing.

Here is the code (I left out the startTag and endTag functions, but they really don't do anything):

Code: Select all

function fetchData($file) {
  // function to find needed XML data for index update
  echo "<br /><font color='salmon'>Locating case data for $file</font>";
  $xmlparser = xml_parser_create();
  xml_set_element_handler($xmlparser, "startTag", "endTag");
  xml_set_character_data_handler($xmlparser, "getcontents");
  xml_parser_set_option($xmlparser, XML_OPTION_CASE_FOLDING, false); 
  $CURRENT = "";
  global $casedata;
  if (!($fp = fopen($file, "r"))) {
    die("failed to open $file");
  } else {
    while ($data = fread($fp, 4096)){
    $data=eregi_replace(">"."[[]]+"."<","><",$data);
      if (!xml_parse($xmlparser, $data, feof($fp))) {
        $reason = xml_error_string(xml_get_error_code($xmlparser));
        $reason = xml_get_current_line_number($xmlparser);
        die($reason);
      } 
    }
    fclose($fp);
  }
  xml_parser_free($xmlparser);
  return ($casedata);
}// End function

function getcontents($parser, $data){
  global $CURRENT;
  switch ($CURRENT) {
  case "id":
    echo "<br />id is: ($data)";
    break;
  case "short":
    echo "<br />name is: ($data)<br />";
    break;
  }
}// End function


The result I get is something like:

Code: Select all

Locating case data for opinions/1978/F/006/1978-F006-04070001.xml
id is: (1978-F006-04070001)
name is: (A )

name is: (&)

name is: ( M Records, Inc. v. M.V.C. Distrib. Corp.)
Although the XML has a "short" tags wrapped around "A & M Records, Inc. v. M.V.C. Distrib. Corp." The parser seems to act as if the character entity has its own set of "short" tags around it.

The XML:

Code: Select all

<short>A & M Records, Inc. v. M.V.C. Distrib. Corp.</short>

feyd | Please use

Code: Select all

,

Code: Select all

and [syntax="..."] tags where appropriate when posting code. Your post has been edited to reflect how we'd like it posted. Please read:  [url=http://forums.devnetwork.net/viewtopic.php?t=21171]Posting Code in the Forums[/url] to learn how to do it too.[/color]
opengavel
Forum Newbie
Posts: 10
Joined: Tue Sep 12, 2006 2:06 pm
Location: Chicago

Post by opengavel »

RESOLVED:

I couldn't figure out why this was happening but I found other people with the same prob and a solution elsewhere on the Internet.

If the contents parser function assigns the tag data to a variable using the ".=" operator, the string will be put back together. I then ran the htmlspecialchars function to change the & back to the html entity. I still think its awfully strange parser behavior but at least the fix is simple.

What a headache.
Post Reply