I have what I think is a properly formatted xml file (I've removed a bulk of the text for this example):
[text]<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title> </title>
<meta content="http://www.w3.org/1999/xhtml; charset=utf-8" http-equiv="Content-Type"/>
<link href="stylesheet.css" type="text/css" rel="stylesheet"/>
<style type="text/css">
@page { margin-bottom: 5.000000pt; margin-top: 5.000000pt; }</style></head>
<body class="calibre">
<h1 class="title"><span id="anchor6" class="S-T4">ABOUT THIS BOOK</span></h1>
<p class="P-Standard">This book is intended to provide the reader with </p>
<h2 class="title"><span id="anchor7" class="S-T11">GPS Waypoints and Depth</span></h2>
<p class="P-Standard">GPS Waypoints are given in World Geodetic System 1984 (WGS84) i.</p>
<h2 class="title"><span id="anchor8" class="S-T11">Internet</span></h2>
<p class="P-Standard">If you have Internet access, you c</p>
<h2 class="title"><span id="anchor9" class="S-T11">Reporting Information</span></h2>
<p class="P-Standard">If you come across new i:</p>
<p class="P-P8"><span class="S-T11">1. </span>Your Name and Date of Observation</p>
<p class="P-P8"><span class="S-T11">2. </span>Detailed Description</p>
</body></html>[/text]
If I load it into a DOMDocument using either load($filename) or loadXML($string_contents) I have trouble parsing it with xpath. For example query("//p") produces no nodes. If I load it with loadHTML or loadHTMLfile, then the query("//p") works fine.
Are xml xpath queries different or is something else going on with the DOM structure?
XPATH question load, loadXML, loadHTML
Moderator: General Moderators
Re: XPATH question load, loadXML, loadHTML
The HTML parser that PHP uses with loadHTML is more forgiving than the XML parser. By nature, HTML is more permissive than XML. And in practice, browsers are very good at dealing with invalid markup. I'd wager that vast majority of HTML served on the Internet is not valid XML. And much of the Internet isn't valid HTML either.
Your XPath queries should work just fine with loadHTML so there is probably no reason to use loadXML.
BUT NOTE: if you don't carefully control the incoming HTML, you may want to use HTML Tidy to clean up invalid markup and ensure that loadHTML will not choke.
Your XPath queries should work just fine with loadHTML so there is probably no reason to use loadXML.
BUT NOTE: if you don't carefully control the incoming HTML, you may want to use HTML Tidy to clean up invalid markup and ensure that loadHTML will not choke.
Re: XPATH question load, loadXML, loadHTML
HTML Tidy is a good idea. This xml is from calibre's ebook-convert tool and the markup is pretty clean. If I dump the ->save() string out in both load() and loadHTML() cases it appears to have loaded the data fine. In the loadXML and load cases if I dump out all the nodes (query("//*")) everything appears properly parsed. But the query("//p") always comes up blank when the file is loaded as XML.
Is there something different with XPATH in the two cases? Look at this test code:
It outputs very different node paths:
[text]LOADHTML TEST
ALL NODES
/html>
/html/head>
/html/head/title>
/html/head/meta>
/html/head/link>
/html/head/style>
/html/body class=calibre>
/html/body/h1 class=title>
/html/body/h1/span class=S-T4 id=anchor6>
/html/body/p[1] class=P-Standard>
/html/body/h2[1] class=title>
/html/body/h2[1]/span class=S-T11 id=anchor7>
/html/body/p[2] class=P-Standard>
/html/body/h2[2] class=title>
/html/body/h2[2]/span class=S-T11 id=anchor8>
/html/body/p[3] class=P-Standard>
/html/body/h2[3] class=title>
/html/body/h2[3]/span class=S-T11 id=anchor9>
/html/body/p[4] class=P-Standard>
/html/body/p[5] class=P-P8>
/html/body/p[5]/span class=S-T11>
/html/body/p[6] class=P-P8>
/html/body/p[6]/span class=S-T11>
LOADHTML TEST
P NODES
/html/body/p[1] class=P-Standard>
/html/body/p[2] class=P-Standard>
/html/body/p[3] class=P-Standard>
/html/body/p[4] class=P-Standard>
/html/body/p[5] class=P-P8>
/html/body/p[6] class=P-P8>
LOADXML TEST
ALL NODES
/*>
/*/*[1]>
/*/*[1]/*[1]>
/*/*[1]/*[2]>
/*/*[1]/*[3]>
/*/*[1]/*[4]>
/*/*[2] class=calibre>
/*/*[2]/*[1] class=title>
/*/*[2]/*[1]/* class=S-T4 id=anchor6>
/*/*[2]/*[2] class=P-Standard>
/*/*[2]/*[3] class=title>
/*/*[2]/*[3]/* class=S-T11 id=anchor7>
/*/*[2]/*[4] class=P-Standard>
/*/*[2]/*[5] class=title>
/*/*[2]/*[5]/* class=S-T11 id=anchor8>
/*/*[2]/*[6] class=P-Standard>
/*/*[2]/*[7] class=title>
/*/*[2]/*[7]/* class=S-T11 id=anchor9>
/*/*[2]/*[8] class=P-Standard>
/*/*[2]/*[9] class=P-P8>
/*/*[2]/*[9]/* class=S-T11>
/*/*[2]/*[10] class=P-P8>
/*/*[2]/*[10]/* class=S-T11>
LOADXML TEST
P NODES
[/text]
Is there something different with XPATH in the two cases? Look at this test code:
Code: Select all
<?php
$input=<<<TEXT
<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title> </title>
<meta content="http://www.w3.org/1999/xhtml; charset=utf-8" http-equiv="Content-Type"/>
<link href="stylesheet.css" type="text/css" rel="stylesheet"/>
<style type="text/css">
@page { margin-bottom: 5.000000pt; margin-top: 5.000000pt; }</style></head>
<body class="calibre">
<h1 class="title"><span id="anchor6" class="S-T4">ABOUT THIS BOOK</span></h1>
<p class="P-Standard">This book is intended to provide the reader with </p>
<h2 class="title"><span id="anchor7" class="S-T11">GPS Waypoints and Depth</span></h2>
<p class="P-Standard">GPS Waypoints are given in World Geodetic System 1984 (WGS84) i.</p>
<h2 class="title"><span id="anchor8" class="S-T11">Internet</span></h2>
<p class="P-Standard">If you have Internet access, you c</p>
<h2 class="title"><span id="anchor9" class="S-T11">Reporting Information</span></h2>
<p class="P-Standard">If you come across new i:</p>
<p class="P-P8"><span class="S-T11">1. </span>Your Name and Date of Observation</p>
<p class="P-P8"><span class="S-T11">2. </span>Detailed Description</p>
</body></html>
TEXT;
$html = new DOMDocument();
$html->loadHTML($input);
$xpath = new DOMXPath($html);
$elements=array("class","id");
$nodelist = $xpath->query("//*");
echo "LOADHTML TEST\nALL NODES\n";
foreach ($nodelist as $n) {
echo $n->getNodePath();
foreach ($elements as $element) {
if ($n->getAttribute($element) != "")
echo " $element=" . $n->getAttribute($element);
}
echo ">\n";
}
echo "\n";
$nodelist = $xpath->query("//p");
echo "LOADHTML TEST\nP NODES\n";
foreach ($nodelist as $n) {
echo $n->getNodePath();
foreach ($elements as $element) {
if ($n->getAttribute($element) != "")
echo " $element=" . $n->getAttribute($element);
}
echo ">\n";
}
echo "\n";
$html->loadXML($input);
$xpath = new DOMXPath($html);
$elements=array("class","id");
$nodelist = $xpath->query("//*");
echo "LOADXML TEST\nALL NODES\n";
foreach ($nodelist as $n) {
echo $n->getNodePath();
foreach ($elements as $element) {
if ($n->getAttribute($element) != "")
echo " $element=" . $n->getAttribute($element);
}
echo ">\n";
}
echo "\n";
$nodelist = $xpath->query("//p");
echo "LOADXML TEST\nP NODES\n";
foreach ($nodelist as $n) {
echo $n->getNodePath();
foreach ($elements as $element) {
if ($n->getAttribute($element) != "")
echo " $element=" . $n->getAttribute($element);
}
echo ">\n";
}
echo "\n";
?>
[text]LOADHTML TEST
ALL NODES
/html>
/html/head>
/html/head/title>
/html/head/meta>
/html/head/link>
/html/head/style>
/html/body class=calibre>
/html/body/h1 class=title>
/html/body/h1/span class=S-T4 id=anchor6>
/html/body/p[1] class=P-Standard>
/html/body/h2[1] class=title>
/html/body/h2[1]/span class=S-T11 id=anchor7>
/html/body/p[2] class=P-Standard>
/html/body/h2[2] class=title>
/html/body/h2[2]/span class=S-T11 id=anchor8>
/html/body/p[3] class=P-Standard>
/html/body/h2[3] class=title>
/html/body/h2[3]/span class=S-T11 id=anchor9>
/html/body/p[4] class=P-Standard>
/html/body/p[5] class=P-P8>
/html/body/p[5]/span class=S-T11>
/html/body/p[6] class=P-P8>
/html/body/p[6]/span class=S-T11>
LOADHTML TEST
P NODES
/html/body/p[1] class=P-Standard>
/html/body/p[2] class=P-Standard>
/html/body/p[3] class=P-Standard>
/html/body/p[4] class=P-Standard>
/html/body/p[5] class=P-P8>
/html/body/p[6] class=P-P8>
LOADXML TEST
ALL NODES
/*>
/*/*[1]>
/*/*[1]/*[1]>
/*/*[1]/*[2]>
/*/*[1]/*[3]>
/*/*[1]/*[4]>
/*/*[2] class=calibre>
/*/*[2]/*[1] class=title>
/*/*[2]/*[1]/* class=S-T4 id=anchor6>
/*/*[2]/*[2] class=P-Standard>
/*/*[2]/*[3] class=title>
/*/*[2]/*[3]/* class=S-T11 id=anchor7>
/*/*[2]/*[4] class=P-Standard>
/*/*[2]/*[5] class=title>
/*/*[2]/*[5]/* class=S-T11 id=anchor8>
/*/*[2]/*[6] class=P-Standard>
/*/*[2]/*[7] class=title>
/*/*[2]/*[7]/* class=S-T11 id=anchor9>
/*/*[2]/*[8] class=P-Standard>
/*/*[2]/*[9] class=P-P8>
/*/*[2]/*[9]/* class=S-T11>
/*/*[2]/*[10] class=P-P8>
/*/*[2]/*[10]/* class=S-T11>
LOADXML TEST
P NODES
[/text]
Re: XPATH question load, loadXML, loadHTML
That's because you're querying for non-namespaced P elements, while in the document you have xhtml:p elements (assuming prefix xhtml mapped to http://www.w3.org/1999/xhtml).But the query("//p") always comes up blank when the file is loaded as XML.
With loadHtml() DOMDocument ignores any namespaces, so xhtml:p elements become unqualified p elements. That's why your query works in this case.
For xml you want something like this:
Code: Select all
$xpath = new DOMXPath($doc);
$xpath->registerNamespace('xhtml', 'http://www.w3.org/1999/xhtml');
$nodelist = $xpath->query("//xhtml:p");
Re: XPATH question load, loadXML, loadHTML
Awesome bit of insight! Thanks for sharing.Weirdan wrote:...because you're querying for non-namespaced P elements, while in the document you have xhtml:p elements...
Re: XPATH question load, loadXML, loadHTML
Thanks weirden! I didn't know why the name spaces were so different.