XPATH question load, loadXML, loadHTML

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
Eric!
DevNet Resident
Posts: 1146
Joined: Sun Jun 14, 2009 3:13 pm

XPATH question load, loadXML, loadHTML

Post by Eric! »

I have what I think is a properly formatted xml file (I've removed a bulk of the text for this example):
[text]<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title> </title>
<meta content="http://www.w3.org/1999/xhtml; charset=utf-8" http-equiv="Content-Type"/>
<link href="stylesheet.css" type="text/css" rel="stylesheet"/>
<style type="text/css">
@page { margin-bottom: 5.000000pt; margin-top: 5.000000pt; }</style></head>
<body class="calibre">
<h1 class="title"><span id="anchor6" class="S-T4">ABOUT THIS BOOK</span></h1>
<p class="P-Standard">This book is intended to provide the reader with </p>
<h2 class="title"><span id="anchor7" class="S-T11">GPS Waypoints and Depth</span></h2>
<p class="P-Standard">GPS Waypoints are given in World Geodetic System 1984 (WGS84) i.</p>
<h2 class="title"><span id="anchor8" class="S-T11">Internet</span></h2>
<p class="P-Standard">If you have Internet access, you c</p>
<h2 class="title"><span id="anchor9" class="S-T11">Reporting Information</span></h2>
<p class="P-Standard">If you come across new i:</p>
<p class="P-P8"><span class="S-T11">1. </span>Your Name and Date of Observation</p>
<p class="P-P8"><span class="S-T11">2. </span>Detailed Description</p>
</body></html>[/text]

If I load it into a DOMDocument using either load($filename) or loadXML($string_contents) I have trouble parsing it with xpath. For example query("//p") produces no nodes. If I load it with loadHTML or loadHTMLfile, then the query("//p") works fine.

Are xml xpath queries different or is something else going on with the DOM structure?
User avatar
tr0gd0rr
Forum Contributor
Posts: 305
Joined: Thu May 11, 2006 8:58 pm
Location: Utah, USA

Re: XPATH question load, loadXML, loadHTML

Post by tr0gd0rr »

The HTML parser that PHP uses with loadHTML is more forgiving than the XML parser. By nature, HTML is more permissive than XML. And in practice, browsers are very good at dealing with invalid markup. I'd wager that vast majority of HTML served on the Internet is not valid XML. And much of the Internet isn't valid HTML either.

Your XPath queries should work just fine with loadHTML so there is probably no reason to use loadXML.

BUT NOTE: if you don't carefully control the incoming HTML, you may want to use HTML Tidy to clean up invalid markup and ensure that loadHTML will not choke.
Eric!
DevNet Resident
Posts: 1146
Joined: Sun Jun 14, 2009 3:13 pm

Re: XPATH question load, loadXML, loadHTML

Post by Eric! »

HTML Tidy is a good idea. This xml is from calibre's ebook-convert tool and the markup is pretty clean. If I dump the ->save() string out in both load() and loadHTML() cases it appears to have loaded the data fine. In the loadXML and load cases if I dump out all the nodes (query("//*")) everything appears properly parsed. But the query("//p") always comes up blank when the file is loaded as XML.

Is there something different with XPATH in the two cases? Look at this test code:

Code: Select all

<?php
$input=<<<TEXT
<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title> </title>
    <meta content="http://www.w3.org/1999/xhtml; charset=utf-8" http-equiv="Content-Type"/>
  <link href="stylesheet.css" type="text/css" rel="stylesheet"/>
<style type="text/css">
                @page { margin-bottom: 5.000000pt; margin-top: 5.000000pt; }</style></head>
  <body class="calibre">
<h1 class="title"><span id="anchor6" class="S-T4">ABOUT THIS BOOK</span></h1>
<p class="P-Standard">This book is intended to provide the reader with </p>
<h2 class="title"><span id="anchor7" class="S-T11">GPS Waypoints and Depth</span></h2>
<p class="P-Standard">GPS Waypoints are given in World Geodetic System 1984 (WGS84) i.</p>
<h2 class="title"><span id="anchor8" class="S-T11">Internet</span></h2>
<p class="P-Standard">If you have Internet access, you c</p>
<h2 class="title"><span id="anchor9" class="S-T11">Reporting Information</span></h2>
<p class="P-Standard">If you come across new i:</p>
<p class="P-P8"><span class="S-T11">1.  </span>Your Name and Date of Observation</p>
<p class="P-P8"><span class="S-T11">2.  </span>Detailed Description</p>
</body></html>
TEXT;

    $html = new DOMDocument();
    $html->loadHTML($input);
    $xpath = new DOMXPath($html);
    $elements=array("class","id");

    $nodelist = $xpath->query("//*");
    echo "LOADHTML TEST\nALL NODES\n";
    foreach ($nodelist as $n) {
        echo $n->getNodePath();
        foreach ($elements as $element) {
            if ($n->getAttribute($element) != "")
                echo " $element=" . $n->getAttribute($element);
        }
        echo ">\n";
    }
    echo "\n";
    
    $nodelist = $xpath->query("//p");
    echo "LOADHTML TEST\nP NODES\n";
    foreach ($nodelist as $n) {
        echo $n->getNodePath();
        foreach ($elements as $element) {
            if ($n->getAttribute($element) != "")
                echo " $element=" . $n->getAttribute($element);
        }
        echo ">\n";
    }
    echo "\n";

    $html->loadXML($input);
    $xpath = new DOMXPath($html);
    $elements=array("class","id");

    $nodelist = $xpath->query("//*");
    echo "LOADXML TEST\nALL NODES\n";
    foreach ($nodelist as $n) {
        echo $n->getNodePath();
        foreach ($elements as $element) {
            if ($n->getAttribute($element) != "")
                echo " $element=" . $n->getAttribute($element);
        }
        echo ">\n";
    }
    echo "\n";
    
    $nodelist = $xpath->query("//p");
    echo "LOADXML TEST\nP NODES\n";
    foreach ($nodelist as $n) {
        echo $n->getNodePath();
        foreach ($elements as $element) {
            if ($n->getAttribute($element) != "")
                echo " $element=" . $n->getAttribute($element);
        }
        echo ">\n";
    }
    echo "\n";
?>
It outputs very different node paths:
[text]LOADHTML TEST
ALL NODES
/html>
/html/head>
/html/head/title>
/html/head/meta>
/html/head/link>
/html/head/style>
/html/body class=calibre>
/html/body/h1 class=title>
/html/body/h1/span class=S-T4 id=anchor6>
/html/body/p[1] class=P-Standard>
/html/body/h2[1] class=title>
/html/body/h2[1]/span class=S-T11 id=anchor7>
/html/body/p[2] class=P-Standard>
/html/body/h2[2] class=title>
/html/body/h2[2]/span class=S-T11 id=anchor8>
/html/body/p[3] class=P-Standard>
/html/body/h2[3] class=title>
/html/body/h2[3]/span class=S-T11 id=anchor9>
/html/body/p[4] class=P-Standard>
/html/body/p[5] class=P-P8>
/html/body/p[5]/span class=S-T11>
/html/body/p[6] class=P-P8>
/html/body/p[6]/span class=S-T11>

LOADHTML TEST
P NODES
/html/body/p[1] class=P-Standard>
/html/body/p[2] class=P-Standard>
/html/body/p[3] class=P-Standard>
/html/body/p[4] class=P-Standard>
/html/body/p[5] class=P-P8>
/html/body/p[6] class=P-P8>

LOADXML TEST
ALL NODES
/*>
/*/*[1]>
/*/*[1]/*[1]>
/*/*[1]/*[2]>
/*/*[1]/*[3]>
/*/*[1]/*[4]>
/*/*[2] class=calibre>
/*/*[2]/*[1] class=title>
/*/*[2]/*[1]/* class=S-T4 id=anchor6>
/*/*[2]/*[2] class=P-Standard>
/*/*[2]/*[3] class=title>
/*/*[2]/*[3]/* class=S-T11 id=anchor7>
/*/*[2]/*[4] class=P-Standard>
/*/*[2]/*[5] class=title>
/*/*[2]/*[5]/* class=S-T11 id=anchor8>
/*/*[2]/*[6] class=P-Standard>
/*/*[2]/*[7] class=title>
/*/*[2]/*[7]/* class=S-T11 id=anchor9>
/*/*[2]/*[8] class=P-Standard>
/*/*[2]/*[9] class=P-P8>
/*/*[2]/*[9]/* class=S-T11>
/*/*[2]/*[10] class=P-P8>
/*/*[2]/*[10]/* class=S-T11>

LOADXML TEST
P NODES

[/text]
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Re: XPATH question load, loadXML, loadHTML

Post by Weirdan »

But the query("//p") always comes up blank when the file is loaded as XML.
That's because you're querying for non-namespaced P elements, while in the document you have xhtml:p elements (assuming prefix xhtml mapped to http://www.w3.org/1999/xhtml).

With loadHtml() DOMDocument ignores any namespaces, so xhtml:p elements become unqualified p elements. That's why your query works in this case.

For xml you want something like this:

Code: Select all

$xpath = new DOMXPath($doc);
$xpath->registerNamespace('xhtml', 'http://www.w3.org/1999/xhtml');
$nodelist = $xpath->query("//xhtml:p");
User avatar
tr0gd0rr
Forum Contributor
Posts: 305
Joined: Thu May 11, 2006 8:58 pm
Location: Utah, USA

Re: XPATH question load, loadXML, loadHTML

Post by tr0gd0rr »

Weirdan wrote:...because you're querying for non-namespaced P elements, while in the document you have xhtml:p elements...
Awesome bit of insight! Thanks for sharing.
Eric!
DevNet Resident
Posts: 1146
Joined: Sun Jun 14, 2009 3:13 pm

Re: XPATH question load, loadXML, loadHTML

Post by Eric! »

Thanks weirden! I didn't know why the name spaces were so different.
Post Reply