Can some one tell me how to get started with parsing an HTML document using one of the PHP XML functions, i.e. which of the many PHP XML facilities can I use and how do I use it?
I have a file containing an HTML List that contains sublists, and I was wondering whether I could parse it like an XML file (which I'm sure you can) so I can determine the relationships between the lists.
For example if I know what the content is of one of the <li> tags in the nested list, I want to work out what the content of the parent <li> tag was that contains the <ul> nested list.
I'm not too hot on navigating through XML at the moment so this is really just an experimental exercise to form a breadcrumb from a hardcoded CSS menu. Usually these menus are formed from a database so I can use the database to generate the breadcrumb, but I thought I'd take a chance to learn something new, and I'm not saying its the best way of doing things.
I'm searching for tutorials and articles as we speak, but some starter code or if you point me to the relevant functions on PHP.net that would be helpful. There seems to be a lot of choice in the manual and I don't know where to start!
[solved] Parsing an HTML file with PHP XML functions
Moderator: General Moderators
- Skittlewidth
- Forum Contributor
- Posts: 389
- Joined: Wed Nov 06, 2002 9:18 am
- Location: Kent, UK
[solved] Parsing an HTML file with PHP XML functions
Last edited by Skittlewidth on Thu Dec 14, 2006 6:35 am, edited 1 time in total.
With the DOM extension of php 5 I can e.g. doand get all the <li> entries of the
Take a look at http://en.wikipedia.org/wiki/XML and http://en.wikipedia.org/wiki/XPath esp. the links to tutorials
Code: Select all
$doc = DOMDocument::loadhtmlfile('http://www.php.net');
$xpath = new DOMXPath($doc);
$nodeset = $xpath->query('//div[@id="leftbar"]//ul[position()=1]//li');
foreach($nodeset as $node) {
echo $node->textContent, "<br />\n";
}list on http://www.php.netThanks To
* easyDNS
* Directi
* pair Networks
* EV1Servers
* Server Central
* Hosted Solutions
* Spry VPS Hosting
* eZ systems / HiT
* OSU Open Source Lab
* Emini A/S
* Yahoo! Inc.
Take a look at http://en.wikipedia.org/wiki/XML and http://en.wikipedia.org/wiki/XPath esp. the links to tutorials
- Skittlewidth
- Forum Contributor
- Posts: 389
- Joined: Wed Nov 06, 2002 9:18 am
- Location: Kent, UK
- Skittlewidth
- Forum Contributor
- Posts: 389
- Joined: Wed Nov 06, 2002 9:18 am
- Location: Kent, UK
feyd | Please use [/syntax]
This successfully identifies "Contact Us" as the parent of the "Key Personnel" link on the menu. (i.e. "Key Personnel" is on a flyout from "Contact Us")
I think I could avoid the double parentNode call if I modified the xpath to go for the first parent node but I was struggling with the syntax and it kept returning more data than I was after.
Any modification suggestions are welcome!

feyd | Please use
Code: Select all
,Code: Select all
and [syntax="..."] tags where appropriate when posting code. Your post has been edited to reflect how we'd like it posted. Please read: [url=http://forums.devnetwork.net/viewtopic.php?t=21171]Posting Code in the Forums[/url] to learn how to do it too.[/color]
For anyone interested:
[syntax="html"]
<span id="test">
<ul>
<li><a href="index.php" title="Process Systems">Process Systems</a>
<ul>
<li><a href="intake_storage" title="Intake & Storage">Intake & Storage</a></li>
<li><a href="mixing_blending" title="Mixing & Blending">Mixing & Blending</a></li>
<li><a href="heat_treatment" title="Heat Treatment / Pasteurisation">Heat Treatment / Pasteurisation</a></li>
<li><a href="separation_homogenisation" title="Separation & Homogenisation">Separation & Homogenisation</a></li>
<li style="border-bottom:1px solid #016867"><a href="cleaning_in_place" title="Cleaning in Place">Cleaning in Place</a></li>
</ul>
</li>
<li><a href="links" title="Links">Links</a></li>
<li><a href="contact_us" title="Contact Us">Contact Us</a>
<ul>
<li><a href="key_personnel" title="Key Personnel">Key Personnel</a></li>
<li style="border-bottom:1px solid #016867"><a href="location" title="Location Map">Location Map</a></li>
</ul>
</li>
</ul>
</span> Code: Select all
$doc = DOMDocument::loadhtmlfile('includes/menu.inc.php');
$xpath = new DOMXPath($doc);
$nodeset = $xpath->query('//span[@id="test"]//li[a="Key Personnel"]');
foreach($nodeset as $node) {
$parent = $node->parentNode->parentNode; // <ul> is the next node up, we want the <li> before it.
echo "<pre>".$parent->firstChild->textContent. "</pre>";
}I think I could avoid the double parentNode call if I modified the xpath to go for the first parent node but I was struggling with the syntax and it kept returning more data than I was after.
Any modification suggestions are welcome!
feyd | Please use
Code: Select all
,Code: Select all
and [syntax="..."] tags where appropriate when posting code. Your post has been edited to reflect how we'd like it posted. Please read: [url=http://forums.devnetwork.net/viewtopic.php?t=21171]Posting Code in the Forums[/url] to learn how to do it too.[/color]