[solved] Parsing an HTML file with PHP XML functions

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
Skittlewidth
Forum Contributor
Posts: 389
Joined: Wed Nov 06, 2002 9:18 am
Location: Kent, UK

[solved] Parsing an HTML file with PHP XML functions

Post by Skittlewidth »

Can some one tell me how to get started with parsing an HTML document using one of the PHP XML functions, i.e. which of the many PHP XML facilities can I use and how do I use it?

I have a file containing an HTML List that contains sublists, and I was wondering whether I could parse it like an XML file (which I'm sure you can) so I can determine the relationships between the lists.

For example if I know what the content is of one of the <li> tags in the nested list, I want to work out what the content of the parent <li> tag was that contains the <ul> nested list.

I'm not too hot on navigating through XML at the moment so this is really just an experimental exercise to form a breadcrumb from a hardcoded CSS menu. Usually these menus are formed from a database so I can use the database to generate the breadcrumb, but I thought I'd take a chance to learn something new, and I'm not saying its the best way of doing things.

I'm searching for tutorials and articles as we speak, but some starter code or if you point me to the relevant functions on PHP.net that would be helpful. There seems to be a lot of choice in the manual and I don't know where to start!
Last edited by Skittlewidth on Thu Dec 14, 2006 6:35 am, edited 1 time in total.
User avatar
volka
DevNet Evangelist
Posts: 8391
Joined: Tue May 07, 2002 9:48 am
Location: Berlin, ger

Post by volka »

With the DOM extension of php 5 I can e.g. do

Code: Select all

$doc = DOMDocument::loadhtmlfile('http://www.php.net');
$xpath = new DOMXPath($doc);
$nodeset = $xpath->query('//div[@id="leftbar"]//ul[position()=1]//li');
foreach($nodeset as $node) {
	echo $node->textContent, "<br />\n";
}
and get all the <li> entries of the
Thanks To

* easyDNS
* Directi
* pair Networks
* EV1Servers
* Server Central
* Hosted Solutions
* Spry VPS Hosting
* eZ systems / HiT
* OSU Open Source Lab
* Emini A/S
* Yahoo! Inc.
list on http://www.php.net

Take a look at http://en.wikipedia.org/wiki/XML and http://en.wikipedia.org/wiki/XPath esp. the links to tutorials
User avatar
Skittlewidth
Forum Contributor
Posts: 389
Joined: Wed Nov 06, 2002 9:18 am
Location: Kent, UK

Post by Skittlewidth »

Cool, I was playing around with something along those lines but wasn't quite doing it right. I'll take a look at that sample code.
User avatar
Skittlewidth
Forum Contributor
Posts: 389
Joined: Wed Nov 06, 2002 9:18 am
Location: Kent, UK

Post by Skittlewidth »

feyd | Please use

Code: Select all

,

Code: Select all

and [syntax="..."] tags where appropriate when posting code. Your post has been edited to reflect how we'd like it posted. Please read:  [url=http://forums.devnetwork.net/viewtopic.php?t=21171]Posting Code in the Forums[/url] to learn how to do it too.[/color]


For anyone interested:
[syntax="html"]
<span id="test">
<ul>
  		<li><a href="index.php" title="Process Systems">Process Systems</a>
        <ul>
          <li><a href="intake_storage" title="Intake & Storage">Intake & Storage</a></li>
          <li><a href="mixing_blending" title="Mixing & Blending">Mixing & Blending</a></li>
          <li><a href="heat_treatment" title="Heat Treatment / Pasteurisation">Heat Treatment / Pasteurisation</a></li>
		  <li><a href="separation_homogenisation" title="Separation & Homogenisation">Separation & Homogenisation</a></li>
          <li style="border-bottom:1px solid #016867"><a href="cleaning_in_place" title="Cleaning in Place">Cleaning in Place</a></li>
        </ul>
      </li>
		<li><a href="links" title="Links">Links</a></li>
		<li><a href="contact_us" title="Contact Us">Contact Us</a>
		  <ul>
			<li><a href="key_personnel" title="Key Personnel">Key Personnel</a></li>
			<li style="border-bottom:1px solid #016867"><a href="location" title="Location Map">Location Map</a></li>
		  </ul>
		</li>
  	</ul> 
</span>	
[/syntax]

Code: Select all

$doc = DOMDocument::loadhtmlfile('includes/menu.inc.php');
$xpath = new DOMXPath($doc);

$nodeset = $xpath->query('//span[@id="test"]//li[a="Key Personnel"]');  
foreach($nodeset as $node) {
                $parent = $node->parentNode->parentNode; // <ul> is the next node up, we want the <li> before it.
		echo "<pre>".$parent->firstChild->textContent. "</pre>";
}
This successfully identifies "Contact Us" as the parent of the "Key Personnel" link on the menu. (i.e. "Key Personnel" is on a flyout from "Contact Us")

I think I could avoid the double parentNode call if I modified the xpath to go for the first parent node but I was struggling with the syntax and it kept returning more data than I was after.
Any modification suggestions are welcome!

:)


feyd | Please use

Code: Select all

,

Code: Select all

and [syntax="..."] tags where appropriate when posting code. Your post has been edited to reflect how we'd like it posted. Please read:  [url=http://forums.devnetwork.net/viewtopic.php?t=21171]Posting Code in the Forums[/url] to learn how to do it too.[/color]
Post Reply